Octo Flow
- GitHub: https://github.com/writeonlycode/octo-flow
- Crates.io: https://crates.io/crates/octo-flow
Octo Flow is a high-performance command-line tool written in Rust for processing large streams of GitHub event data.
The Problem
GitHub Archive datasets are published as compressed NDJSON files, often reaching gigabytes in size. Processing this data efficiently is challenging:
- Loading the full dataset into memory is not feasible
- Traditional tools can be slow or memory-intensive
- Simple tools like
greplack structured JSON parsing
The Solution
Octo Flow processes NDJSON streams line-by-line, transforming raw event data into clean tabular output while maintaining a constant memory footprint.
It is designed for:
- data pipelines
- log processing
- analytics workflows
- ETL preprocessing
Example
curl https://data.gharchive.org/2026-03-11-15.json.gz \
| zcat \
| octo-flow --input - --event WatchEventExample Output
2489651057 2015-01-01T15:00:03Z SametSisartenep visionmedia/debug WatchEvent
2489651078 2015-01-01T15:00:05Z comcxx11 phpsysinfo/phpsysinfo WatchEvent
2489651080 2015-01-01T15:00:05Z Soufien wasabeef/awesome-android-libraries WatchEventArchitecture
The tool uses a streaming pipeline:
input stream
↓
BufReader
↓
line iterator
↓
serde_json parser
↓
event filter
↓
TSV output
This design allows multi-gigabyte datasets to be processed using only a few megabytes of memory.
Technical Highlights
- Streaming JSON parsing with Serde
- Zero-copy deserialization using
&str - Buffered I/O with
BufReader - Iterator-based processing
- Structured error handling with
thiserror - CLI integration tests using
assert_cmd
Performance
Benchmark on a ~9.5MB dataset (~65k events):
| Tool | Time |
|---|---|
| jq | 0.40s |
| octo-flow | 0.053s |
| grep | 0.001s |
While grep is faster, it does not perform structured parsing.
Octo Flow achieves near-native performance with full JSON awareness.
Outcome
Octo Flow demonstrates how to build efficient, streaming data pipelines in Rust, combining:
- predictable memory usage
- high throughput
- strong type safety
This approach is especially useful for systems that need to process large volumes of data reliably.