Octo Flow is a high-performance command-line tool written in Rust for processing large streams of GitHub event data.

The Problem

GitHub Archive datasets are published as compressed NDJSON files, often reaching gigabytes in size. Processing this data efficiently is challenging:

  • Loading the full dataset into memory is not feasible
  • Traditional tools can be slow or memory-intensive
  • Simple tools like grep lack structured JSON parsing

The Solution

Octo Flow processes NDJSON streams line-by-line, transforming raw event data into clean tabular output while maintaining a constant memory footprint.

It is designed for:

  • data pipelines
  • log processing
  • analytics workflows
  • ETL preprocessing

Example

curl https://data.gharchive.org/2026-03-11-15.json.gz \
| zcat \
| octo-flow --input - --event WatchEvent

Example Output

2489651057	2015-01-01T15:00:03Z	SametSisartenep	visionmedia/debug	WatchEvent
2489651078	2015-01-01T15:00:05Z	comcxx11	phpsysinfo/phpsysinfo	WatchEvent
2489651080	2015-01-01T15:00:05Z	Soufien	wasabeef/awesome-android-libraries	WatchEvent

Architecture

The tool uses a streaming pipeline:

input stream

BufReader

line iterator

serde_json parser

event filter

TSV output

This design allows multi-gigabyte datasets to be processed using only a few megabytes of memory.

Technical Highlights

  • Streaming JSON parsing with Serde
  • Zero-copy deserialization using &str
  • Buffered I/O with BufReader
  • Iterator-based processing
  • Structured error handling with thiserror
  • CLI integration tests using assert_cmd

Performance

Benchmark on a ~9.5MB dataset (~65k events):

ToolTime
jq0.40s
octo-flow0.053s
grep0.001s

While grep is faster, it does not perform structured parsing. Octo Flow achieves near-native performance with full JSON awareness.

Outcome

Octo Flow demonstrates how to build efficient, streaming data pipelines in Rust, combining:

  • predictable memory usage
  • high throughput
  • strong type safety

This approach is especially useful for systems that need to process large volumes of data reliably.