Files
knot/README.md

242 lines
6.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Rope Data Structure (C++ Implementation)
This library provides an efficient **rope** implementation in C++ for large, mutable text.
It supports fast insertion, deletion, concatenation, iteration, and regex-based search.
The design uses fixed-size leaf chunks and automatically maintains a balanced binary tree.
## Features
* Efficient handling of large text buffers
* Log-time insertion, deletion, and concatenation
* Optional tunable chunk sizes
* Line, leaf, and byte iterators
* Regex search using PCRE2 (DFA mode)
* Regex using `noose` can also be done but it is much slower for now, I am working on making it faster.
* Zero-copy leaf iteration (safe immutable reads)
* Fully heap-allocated, manual memory management
* Deterministic balancing strategy (AVL-style depth tracking)
## ⚡ Performance Benchmarks
Performance comparison between this Rope implementation and the standard C++ `std::string`.
**Test Context:**
* **Dataset:** ~1 GiB generated text (1,073,741,856 bytes).
* **Goal:** Simulating heavy text-editing operations.
### Key Highlights
When handling large datasets, the structural advantages of the Rope become massive:
* **Appending Large Text:** **2.4 Million x faster** (0.001ms vs 1831ms)
* **Splitting Text:** **162,000x faster**
* **Insert (Middle):** **87,000x faster**
* **Erase:** **15,000x faster**
### Numbers
| Operation | Rope Time | String Time | String to Rope Ratio | Winner |
| :--- | :--- | :--- | :--- | :--- |
| **Concat (Append Large)** | **0.001 ms** | 1831.666 ms | **2,429,265x** | ✅ **Rope** |
| **Split (Half)** | **0.008 ms** | 1353.086 ms | **162,240x** | ✅ **Rope** |
| **Insert (Middle)** | **0.011 ms** | 976.105 ms | **87,206x** | ✅ **Rope** |
| **Erase (5KB)** | **0.013 ms** | 207.760 ms | **15,953x** | ✅ **Rope** |
| **Iterate 1000 Lines** | **0.063 ms** | 5.723 ms | **90x** | ✅ **Rope** |
| **Concat (Append small)** | 0.002 ms | **0.000 ms** | 0.1 | ❌ String |
| **Load / Create** | 1512.449 ms | **871.330 ms** | 0.58x | ❌ String |
| **Free / Destruct** | 194.167 ms | **153.365 ms** | 0.79x | ❌ String |
| **Read / Substr (1KiB)** | 0.008 ms | **0.002 ms** | 0.24x | ❌ String |
| **Search (Regex)** | 6417.719 ms | **1.526 ms** | ~0x | ❌ String |
**Why is the Rope faster?**<br>
Standard strings require contiguous memory. When you insert text into the middle of a 1GB `std::string`, the CPU must shift all subsequent bytes in memory ($O(n)$ complexity). This Rope implementation uses a tree structure, allowing insertions and deletions by simply modifying pointers ($O(\log n)$ complexity).
Also for concatenation ropes have amortized constant time vs strings with linear time.<br>
So smaller concatenation strings are faster but for any larger concatenation the rope is best.<br>
**Why is the String faster for Regex/Read?**<br>
`std::string` is just a flat array of bytes, which is extremely cache-friendly for linear scanning and regex engines. Ropes require tree traversal to read sequential data, resulting in higher overhead for read-only operations.
I have also tested against `ropey` the rust library and rust strings.<br>
Ropey has much faster load times but is slightly slower than my implementation in other operations.<br>
I am not including the tests here as my test might not be in the most optimal condition and so have different results.<br>
The tests included here are the ones that I have done myself with clang++ compiling both tests in the same binary. (see: `tests/benchmark.cpp`)<br>
Also on a side note rust strings are faster than c++ strings apart from reading.
## Rope Node Structure
Each rope node (`Knot`) is either:
* an **internal node** with `left` and `right` children, or
* a **leaf node** containing text data of size `chunk_size`.
Metadata stored per node:
* `depth` subtree height
* `chunk_size` leaf capacity
* `line_count` number of `\n` newline characters
* `char_count` total byte length covered by the subtree
```c
typedef struct Knot {
Knot *left;
Knot *right;
uint8_t depth;
uint32_t chunk_size;
uint32_t line_count;
uint32_t char_count;
char data[];
} Knot;
```
## Chunk Size
The library supports arbitrary positive chunk sizes, but provides:
```c
uint32_t optimal_chunk_size(uint64_t length);
```
This suggests a chunk size based on the target input size.
Valid range is:
* **MIN_CHUNK_SIZE:** 64 bytes
* **MAX_CHUNK_SIZE:** 8192 bytes
You may choose any value; all ropes participating in concat operations **must use the same chunk size**.
# API Overview
## Construction and Loading
```c
load(char *str, uint32_t len, uint32_t chunk_size)
```
Builds a rope from a raw byte buffer.
`str` is not consumed and may be freed after loading.
## Structural Operations
```c
Knot *concat(Knot *left, Knot *right)
```
Concatenates two ropes (must share chunk size).
Both input roots are invalid after the call.
```c
Knot *insert(Knot *node, uint32_t offset, char *str, uint32_t len)
```
Inserts text at a byte offset.
Returns a new rope.
The original node becomes invalid.
```c
Knot *erase(Knot *node, uint32_t offset, uint32_t len)
```
Deletes a byte range.
Returns a new rope.
The original node becomes invalid.
```c
void split(Knot *node, uint32_t offset, Knot **left, Knot **right)
```
Splits a rope into two ropes at the given byte offset.
The original node becomes invalid.
```c
char *read(Knot *root, uint32_t offset, uint32_t len)
```
Extracts a substring.
Returns a null-terminated buffer; caller must free it.
## Line & Byte Mapping
```c
uint32_t byte_to_line(Knot *root, uint32_t offset)
```
Converts a byte offset into a line index.
```c
uint32_t line_to_byte(Knot *root, uint32_t line, uint32_t *out_len)
```
Returns the byte offset of the beginning of a line and outputs that lines length.
## Iterators
### Line Iterator
```c
LineIterator *begin_l_iter(Knot *root, uint32_t start_line);
char *next_line(LineIterator *it); // caller frees result
```
### Leaf Iterator
```c
LeafIterator *begin_k_iter(Knot *root);
char *next_leaf(LeafIterator *it); // DO NOT free result
```
### Byte Iterator
```c
ByteIterator *begin_b_iter(Knot *root);
char next_byte(ByteIterator *it);
```
All iterators must be freed by the caller after use.
> For `ByteIterator`, `ByteIterator.it` must be freed before the iterator is freed.
## Searching
```cpp
std::vector<std::pair<size_t, size_t>> search_rope(Knot *root, const char *pattern)
```
Searches the rope using PCRE2 in DFA mode.
Returns a vector of `(start_offset, length)` pairs.
Only deterministic patterns are supported (no backtracking).
## Memory Management
```c
void free_rope(Knot *root)
```
Recursively frees all nodes in the rope.
Must be called once when the rope is no longer needed.
# Example Usage
```c
uint32_t chunk_size = optimal_chunk_size(strlen(input));
Knot *r = load(input, strlen(input), chunk_size);
r = insert(r, 5, "hello", 5);
r = erase(r, 20, 3);
char *sub = read(r, 10, 25);
printf("%s\n", sub);
free(sub);
free_rope(r);
```