2025-11-28 19:14:43 +00:00
2025-11-28 20:57:38 +00:00
2025-11-28 20:57:38 +00:00
2025-11-28 19:14:43 +00:00
2025-11-28 20:57:38 +00:00

Rope Data Structure (C++ Implementation)

This library provides an efficient rope implementation in C++ for large, mutable text. It supports fast insertion, deletion, concatenation, iteration, and regex-based search. The design uses fixed-size leaf chunks and automatically maintains a balanced binary tree.

Features

  • Efficient handling of large text buffers
  • Log-time insertion, deletion, and concatenation
  • Optional tunable chunk sizes
  • Line, leaf, and byte iterators
  • Regex search using PCRE2 (DFA mode)
  • Regex using noose can also be done but it is much slower for now, I am working on making it faster.
  • Zero-copy leaf iteration (safe immutable reads)
  • Fully heap-allocated, manual memory management
  • Deterministic balancing strategy (AVL-style depth tracking)

Performance Benchmarks

Performance comparison between this Rope implementation and the standard C++ std::string.

Test Context:

  • Dataset: ~1 GiB generated text (1,073,741,856 bytes).
  • Goal: Simulating heavy text-editing operations.

Key Highlights

When handling large datasets, the structural advantages of the Rope become massive:

  • Appending Large Text: 2.4 Million x faster (0.001ms vs 1831ms)
  • Splitting Text: 162,000x faster
  • Insert (Middle): 87,000x faster
  • Erase: 15,000x faster

Numbers

Operation Rope Time String Time String to Rope Ratio Winner
Concat (Append Large) 0.001 ms 1831.666 ms 2,429,265x Rope
Split (Half) 0.008 ms 1353.086 ms 162,240x Rope
Insert (Middle) 0.011 ms 976.105 ms 87,206x Rope
Erase (5KB) 0.013 ms 207.760 ms 15,953x Rope
Iterate 1000 Lines 0.063 ms 5.723 ms 90x Rope
Concat (Append small) 0.002 ms 0.000 ms 0.1 String
Load / Create 1512.449 ms 871.330 ms 0.58x String
Free / Destruct 194.167 ms 153.365 ms 0.79x String
Read / Substr (1KiB) 0.008 ms 0.002 ms 0.24x String
Search (Regex) 6417.719 ms 1.526 ms ~0x String

Why is the Rope faster?
Standard strings require contiguous memory. When you insert text into the middle of a 1GB std::string, the CPU must shift all subsequent bytes in memory (O(n) complexity). This Rope implementation uses a tree structure, allowing insertions and deletions by simply modifying pointers (O(\log n) complexity).

Also for concatenation ropes have amortized constant time vs strings with linear time.
So smaller concatenation strings are faster but for any larger concatenation the rope is best.

Why is the String faster for Regex/Read?
std::string is just a flat array of bytes, which is extremely cache-friendly for linear scanning and regex engines. Ropes require tree traversal to read sequential data, resulting in higher overhead for read-only operations.

I have also tested against ropey the rust library and rust strings.
Ropey has much faster load times but is slightly slower than my implementation in other operations.
I am not including the tests here as my test might not be in the most optimal condition and so have different results.
The tests included here are the ones that I have done myself with clang++ compiling both tests in the same binary. (see: tests/benchmark.cpp)
Also on a side note rust strings are faster than c++ strings apart from reading.

Rope Node Structure

Each rope node (Knot) is either:

  • an internal node with left and right children, or
  • a leaf node containing text data of size chunk_size.

Metadata stored per node:

  • depth subtree height
  • chunk_size leaf capacity
  • line_count number of \n newline characters
  • char_count total byte length covered by the subtree
typedef struct Knot {
    Knot *left;
    Knot *right;
    uint8_t depth;
    uint32_t chunk_size;
    uint32_t line_count;
    uint32_t char_count;
    char data[];
} Knot;

Chunk Size

The library supports arbitrary positive chunk sizes, but provides:

uint32_t optimal_chunk_size(uint64_t length);

This suggests a chunk size based on the target input size. Valid range is:

  • MIN_CHUNK_SIZE: 64 bytes
  • MAX_CHUNK_SIZE: 8192 bytes

You may choose any value; all ropes participating in concat operations must use the same chunk size.

API Overview

Construction and Loading

load(char *str, uint32_t len, uint32_t chunk_size)

Builds a rope from a raw byte buffer. str is not consumed and may be freed after loading.

Structural Operations

Knot *concat(Knot *left, Knot *right)

Concatenates two ropes (must share chunk size). Both input roots are invalid after the call.

Knot *insert(Knot *node, uint32_t offset, char *str, uint32_t len)

Inserts text at a byte offset. Returns a new rope. The original node becomes invalid.

Knot *erase(Knot *node, uint32_t offset, uint32_t len)

Deletes a byte range. Returns a new rope. The original node becomes invalid.

void split(Knot *node, uint32_t offset, Knot **left, Knot **right)

Splits a rope into two ropes at the given byte offset. The original node becomes invalid.

char *read(Knot *root, uint32_t offset, uint32_t len)

Extracts a substring. Returns a null-terminated buffer; caller must free it.

Line & Byte Mapping

uint32_t byte_to_line(Knot *root, uint32_t offset)

Converts a byte offset into a line index.

uint32_t line_to_byte(Knot *root, uint32_t line, uint32_t *out_len)

Returns the byte offset of the beginning of a line and outputs that lines length.

Iterators

Line Iterator

LineIterator *begin_l_iter(Knot *root, uint32_t start_line);
char *next_line(LineIterator *it);  // caller frees result

Leaf Iterator

LeafIterator *begin_k_iter(Knot *root);
char *next_leaf(LeafIterator *it);  // DO NOT free result

Byte Iterator

ByteIterator *begin_b_iter(Knot *root);
char next_byte(ByteIterator *it);

All iterators must be freed by the caller after use.

For ByteIterator, ByteIterator.it must be freed before the iterator is freed.

Searching

std::vector<std::pair<size_t, size_t>> search_rope(Knot *root, const char *pattern)

Searches the rope using PCRE2 in DFA mode. Returns a vector of (start_offset, length) pairs. Only deterministic patterns are supported (no backtracking).

Memory Management

void free_rope(Knot *root)

Recursively frees all nodes in the rope. Must be called once when the rope is no longer needed.

Example Usage

uint32_t chunk_size = optimal_chunk_size(strlen(input));
Knot *r = load(input, strlen(input), chunk_size);

r = insert(r, 5, "hello", 5);
r = erase(r, 20, 3);

char *sub = read(r, 10, 25);
printf("%s\n", sub);
free(sub);

free_rope(r);
Description
A performant rope implementation and a regex searcher in c++.
Readme 109 KiB
Languages
C++ 100%