Back to archive

Apr 2026 • 14 min read

Rewriting wc in Rust

A Rust rewrite of GNU wc that uses AVX2 SIMD to count words at 7.5 GB/s -- 27x faster than GNU coreutils on the default mode, discovered through 12 systematic experiments where branchless scalar was a dead end and SIMD was the only path past 1 GB/s.

benchmarksrustrewrite studysimd

Rewrite study

First benchmark pass

Project

GNU wc

Baseline

wc (GNU coreutils) 9.1

Rewrite

wc-rs 0.1.0

Language

Rust

Single-file throughput (MB/s) on a 100MB generated text corpus across default, line-only, word-only, char-only, and byte-only modes. Multi-file throughput on 1,000 files (~53 MB total). Three-way comparison against GNU wc and cw (Freaky/cw).

Default mode

27x faster than GNU wc

Parity

13/13 tests

Word counting

7,490 MB/s

Multi-file (1000)

17x faster than GNU wc

vs cw (Rust)

13.5x faster on default

Corpus

1 generated, 1 machine

  • On the default mode (-lwc), wc-rs processes at 7,490 MB/s versus GNU wc at 278 MB/s (27x) and cw at 555 MB/s (13.5x).
  • Three branchless scalar experiments all regressed (-25% to -30%). The CPU branch predictor is highly effective on natural text, and branchless approaches create data dependency chains that prevent speculation.
  • SIMD was the only path past 1 GB/s. SSE2 (16 bytes at a time) gave 3.5x over scalar, AVX2 (32 bytes) doubled that, and fusing line+word counting into one AVX2 pass added another 36%.
  • Character counting (-m) is trivially SIMD-able: count non-continuation bytes with a single AVX2 comparison. Result: 6,168 MB/s (21x GNU wc).
  • Multi-file workloads benefit from rayon parallelism: 1,000 files in 11ms versus GNU wc at 191ms.
  • This was a narrow result: one generated corpus, one x86_64 machine, and a benchmark harness with 5-run medians. The generated corpus is pure ASCII; UTF-8-heavy or binary-heavy workloads could change the throughput profile.

I rewrote GNU coreutils wc in Rust and ran 12 systematic experiments to find the fastest counting strategy. The result processes the default mode (lines, words, bytes) at 7,490 MB/s -- 27x faster than GNU wc and 13.5x faster than cw, the existing Rust alternative.

The interesting part was not the final number. It was discovering that the textbook optimization (branchless word counting) made things slower, and that SIMD was the only path past 1 GB/s. Three branchless experiments all regressed by 25-30% before the first SIMD experiment jumped throughput from 628 MB/s to 2,231 MB/s in one step.

Horizontal bar chart comparing default-mode throughput. wc-rs at 7,490 MB/s, cw at 555 MB/s, GNU wc at 278 MB/s.
Default-mode throughput on a 100 MB generated text corpus. wc-rs uses AVX2 SIMD to classify 32 bytes at a time; GNU wc processes bytes individually through mbrtowc().

The result in one view

100 MB generated text corpus, single file, 5-run median, 12-core x86_64 machine:

ModeGNU wccw (Freaky/cw)wc-rsvs GNU wc
Default (-lwc)278 MB/s555 MB/s7,490 MB/s27x
Words (-w)280 MB/s561 MB/s6,991 MB/s25x
Lines (-l)8,738 MB/s6,991 MB/s9,533 MB/s1.1x
Chars (-m)294 MB/sN/A6,168 MB/s21x
Bytes (-c)fstatfstatfstattied

Multi-file (1,000 files, ~53 MB total):

GNU wccwwc-rs
191 ms102 ms11 ms

The key comparison is default mode, where wc spends most of its time on the word-counting state machine. Line counting is already near memory bandwidth for all tools. Byte counting is just fstat.

What was rewritten

The Rust implementation replaces GNU wc's C code with 586 lines of Rust (counting core) and 234 lines (CLI). The architecture:

AspectGNU wc (C)wc-rs (Rust)
Word countingByte-by-byte with iswspace() / mbrtowc()AVX2 SIMD: 32 bytes at a time with cmpgt + movemask + popcount
Line countingByte-by-byte with \n checkFused into the AVX2 word loop via cmpeq
Char countingmbrtowc() per byteAVX2: count non-continuation bytes (b > 0xBF signed)
Byte countingfstat fast pathSame
I/OBuffered read (128K)mmap for regular files, 256K buffered fallback for stdin/pipes
Multi-fileSequentialrayon::par_iter across all CPU cores
CLICustom C argument parsingclap derive
Dependencieslibcclap, memchr, memmap2, rayon

Parity tests verify that wc-rs output matches GNU wc exactly across 13 scenarios: empty files, single-line, multi-line, no trailing newline, whitespace-heavy, tabs, binary data, UTF-8, control characters, multi-file totals, and stdin.

The word-counting problem

Word counting is the bottleneck in wc. Lines are cheap (scan for \n). Bytes are free (fstat). But words require tracking a state machine: you need to know whether the previous byte was inside a word to decide whether the current byte starts a new one.

GNU wc handles this with locale-aware character classification using mbrtowc() and iswspace(). For UTF-8 locales, this means decoding multibyte sequences on every byte, even when the input is pure ASCII. The cost adds up: ~280 MB/s on this corpus.

The naive Rust implementation replaces the locale machinery with a 256-byte lookup table that classifies each byte as whitespace, word character, or "other" (control bytes, continuation bytes). Three categories instead of two because GNU wc treats control characters as neither starting nor ending words -- they pass through without changing the word state. This matters for binary-adjacent data parity.

static BYTE_CLASS: [u8; 256] = {
    let mut table = [0u8; 256];
    table[9] = IS_WHITESPACE;           // tab
    table[10] = IS_WHITESPACE | IS_NEWLINE; // newline
    // ... other whitespace
    table[32] = IS_WHITESPACE;           // space
    let mut i = 0x21;
    while i <= 0x7E { table[i] = IS_WORD; i += 1; } // printable ASCII
    let mut i = 0xC2;
    while i <= 0xF4 { table[i] = IS_WORD; i += 1; } // UTF-8 lead bytes
    table
};

This baseline runs at 517 MB/s. A few incremental improvements -- larger buffers (256K), memchr for newline counting, mmap for file I/O -- pushed it to 628 MB/s. All straightforward, all keeping the same byte-by-byte word state machine.

The dead end: branchless word counting

The word state machine has branches:

if whitespace       -> leave word
if word_byte && !in_word -> enter word, increment count
else                -> no change (control chars, continuation bytes)

On every byte. The textbook move is to make this branchless: encode the state transitions in a lookup table indexed by (current_state, byte_class), extract the new state and word increment with bit masks, avoid conditional jumps entirely.

I tried three variants:

  1. Two-table lookup -- BYTE_CLASS[byte] for classification, then TRANSITION[state * 4 + class] for the state update. Two memory accesses per byte instead of branches.

  2. Fused 512-byte table -- TABLE[state * 256 + byte] directly encodes both the new state and the word increment. Single lookup per byte, no branches at all.

  3. Scalar bitmask building -- Build a 64-bit bitmask of "is this byte a word character?" for each 64-byte chunk, then detect word starts as not-word-to-word transitions using shifts and popcount.

All three regressed. Badly.

ExperimentThroughputChange
Branch-based baseline628 MB/s--
Two-table branchless454 MB/s-28%
Fused 512-byte table442 MB/s-30%
Scalar bitmask472 MB/s-25%

The reason: the CPU branch predictor is highly effective on natural text. English prose has predictable word/space patterns -- words average ~5 characters, spaces ~1 character. The predictor learns "stay in current state" and gets it right roughly 84% of the time. Well-predicted branches are essentially free.

Branchless approaches replace those free branches with guaranteed data dependency chains. Every byte's result depends on the previous byte's state update. The CPU cannot speculate ahead; it must wait for the table lookup and bit extraction to complete before processing the next byte. The latency of a single L1 cache hit (~4 cycles) becomes the throughput bottleneck.

This is the core finding from the scalar phase: on predictable data, branches beat branchless. The optimization literature often treats "branchless" as universally faster, but that assumes high misprediction rates. For wc on text, the prediction rate is high enough that the branch predictor outperforms any lookup-table scheme.

The breakthrough: SIMD

If you cannot make each byte faster, process more bytes at once.

The insight: word counting reduces to detecting transitions in a bitmask. If I can classify 32 bytes as "word" or "not-word" simultaneously using SIMD, then word starts are just (~word_bits << 1) & word_bits -- positions where the current byte is a word byte and the previous was not. That is a shift, a NOT, an AND, and a popcount. No per-byte state machine at all.

SSE2: 16 bytes at a time (628 -> 2,231 MB/s)

The first SIMD implementation uses SSE2 (baseline on all x86_64):

  1. Load 16 bytes into a 128-bit register
  2. Classify word bytes: cmpgt(v, 0x20) catches printable ASCII (0x21-0x7F), OR'd with a range check for UTF-8 lead bytes (0xC2-0xF4)
  3. movemask extracts a 16-bit integer where each bit indicates whether the corresponding byte is a word character
  4. Word starts: ((!word_bits) << 1 | carry) & word_bits
  5. popcount the result

A single carry bit propagates the word state across chunk boundaries. The entire loop body is a handful of SIMD instructions plus scalar bit manipulation -- no per-byte branches, no table lookups, no data dependency chains between bytes within the same chunk.

Result: 2,231 MB/s, a 3.5x jump over the scalar baseline.

AVX2: 32 bytes at a time (2,231 -> 5,519 MB/s)

Same algorithm, double the register width. With AVX2 (runtime-detected), the movemask produces a 32-bit integer, and the bit manipulation processes 32 bytes per iteration. The throughput improvement is nearly linear with register width.

Fused line+word counting (5,519 -> 7,490 MB/s)

At 5.5 GB/s, the bottleneck shifted from computation to memory bandwidth. The default mode was running two passes: memchr for newlines (SIMD internally) and the AVX2 word loop. Two reads of the same data.

Fusing them into a single AVX2 loop -- adding a cmpeq for newline detection alongside the word classification -- halved the memory bandwidth requirement and added 36%:

unsafe fn count_words_lines_avx2(data: &[u8]) -> (u64, u64) {
    // ... setup omitted
    for chunk in data.chunks_exact(32) {
        let v = _mm256_loadu_si256(chunk.as_ptr() as *const __m256i);
 
        // Line counting: compare each byte against '\n'
        let nl_bits = _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, newline)) as u32;
        lines += nl_bits.count_ones() as u64;
 
        // Word byte detection: printable ASCII OR UTF-8 lead bytes
        let gt_x20 = _mm256_cmpgt_epi8(v, x20);
        let utf8_lead = _mm256_and_si256(
            _mm256_cmpgt_epi8(v, xc1),
            _mm256_cmpgt_epi8(xf5, v),
        );
        let word_bits = _mm256_movemask_epi8(
            _mm256_or_si256(gt_x20, utf8_lead)
        ) as u32;
 
        // Word starts: previous byte was not a word byte, current is
        let word_starts = ((!word_bits) << 1 | prev_carry) & word_bits;
        words += word_starts.count_ones() as u64;
        prev_carry = if word_bits & (1 << 31) != 0 { 0 } else { 1 };
    }
    (words, lines)
}

At 7,490 MB/s on a 100 MB file, this is approaching the practical memory bandwidth ceiling on this machine. Further loop unrolling (64 bytes per iteration, two AVX2 loads) showed no improvement, confirming the bottleneck is memory throughput rather than instruction count.

Character counting

GNU wc's -m mode counts Unicode codepoints using mbrtowc(). In UTF-8, codepoint counting is equivalent to counting bytes that are not continuation bytes (0x80-0xBF). The rest -- ASCII bytes and UTF-8 lead bytes -- each start one codepoint.

With AVX2, this is a single comparison per 32 bytes:

let threshold = _mm256_set1_epi8(-65i8); // 0xBF as signed
let non_cont = _mm256_cmpgt_epi8(v, threshold);
chars += (_mm256_movemask_epi8(non_cont) as u32).count_ones() as u64;

Bytes > 0xBF in signed representation are 0x00-0x7F (ASCII) and 0xC0-0xFF (lead bytes). One comparison, one movemask, one popcount. Result: 6,168 MB/s, or 21x GNU wc.

Multi-file parallelism

For wc *.log workloads, rayon distributes file counting across all CPU cores. Each file gets its own mmap + AVX2 counting pass, and results are collected in original order for sequential printing (matching GNU wc's output format). On 1,000 files (~53 MB total), this takes 11 ms versus 191 ms for GNU wc.

The implementation falls back to sequential processing when stdin (-) is in the file list, since stdin cannot be parallelized.

The experiment trajectory

All 12 experiments were run using an autoresearch loop: make one focused change, measure 5-run median throughput on the 100 MB corpus, run parity tests, keep improvements, revert regressions. Dead ends are recorded with hypotheses and rollback reasons so they are never revisited.

#ExperimentThroughputStatus
1Baseline: byte-by-byte, 64K buffer517 MB/skeep
2256-byte LUT for byte classification555 MB/skeep (+7%)
3256K buffers586 MB/skeep (+6%)
4memchr for newlines, separate word loop620 MB/skeep (+6%)
5mmap for regular files628 MB/skeep (+1%)
6Branchless: two-table transition454 MB/sdiscard (-28%)
7Branchless: fused 512-byte table442 MB/sdiscard (-30%)
8Scalar bitmask building472 MB/sdiscard (-25%)
9SSE2 SIMD word counting (16 bytes)2,231 MB/skeep (+255%)
10AVX2 upgrade (32 bytes)5,519 MB/skeep (+147%)
11Fused line+word AVX2 pass7,490 MB/skeep (+36%)
1264-byte loop unroll7,490 MB/sdiscard (no gain)
Scatter plot showing throughput across 12 experiments. Scalar phase (green, 517-628 MB/s), branchless dead ends (red, 442-472 MB/s), SIMD breakthrough (blue, 2231-7490 MB/s).
Throughput trajectory across 12 experiments. Green dots are kept, red are discarded. The branchless valley (experiments 6-8) and SIMD step function (9-11) are clearly visible.

The trajectory has a clear shape: incremental scalar gains (experiments 1-5), a dead-end valley (6-8), and a step-function SIMD breakthrough (9-11) that saturated at memory bandwidth (12). The autoresearch loop surfaced the branchless dead end quickly -- three experiments with clear rollback data -- rather than letting it consume days of manual optimization effort.

Limitations

Four cards summarizing the study limitations: one generated ASCII corpus, one x86_64 machine, 5-run medians without confidence intervals, and GNU wc locale overhead contributing to the gap.
The 27x result is real but bounded by corpus diversity, platform coverage, measurement depth, and the locale overhead difference between wc-rs and GNU wc.
  1. One corpus. All throughput numbers come from a single 100 MB generated text file of random ASCII words and spaces. Real-world text (UTF-8 heavy, binary-adjacent, tab-heavy) could shift the throughput profile. In particular, the -L (max line length) mode is still scalar because tab expansion creates a per-byte column dependency that SIMD cannot parallelize.

  2. One machine. All benchmarks ran on one 12-core x86_64 Linux machine with AVX2. ARM (no AVX2), older x86 (SSE2 fallback), or machines with different memory bandwidth would produce different absolute numbers. The SSE2 fallback path exists but was not separately benchmarked.

  3. Generated corpus. The benchmark corpus is pure ASCII with no multibyte characters. The SIMD word-counting logic handles UTF-8 lead bytes correctly (they start words) but continuation bytes are classified as "other" (no state change). This matches GNU wc behavior on valid UTF-8 but has not been stress-tested on malformed or heavily multibyte input.

  4. 5-run medians. Each data point is a 5-run median with filesystem cache warm. No confidence intervals are computed. The throughput numbers are reproducible across runs on this machine but formal statistical analysis was not performed.

  5. GNU wc locale overhead. GNU wc uses mbrtowc() and iswspace() in UTF-8 locales, which adds per-byte overhead that a byte-level classification table avoids. The 27x speedup partially reflects this locale overhead difference rather than pure algorithmic improvement. Running GNU wc under LC_ALL=C would narrow the gap.

Correctness

Output matches GNU wc on well-formed text inputs across 15 parity test scenarios (including stdin with each single flag and -m combinations). The tests compare full stdout output (formatting, column widths, filenames, totals) between GNU wc and wc-rs. Known limitation: -m counts non-continuation bytes, which differs from GNU wc's mbrtowc() on malformed UTF-8 sequences.

CategoryTestsCoverage
Empty file7Default, -l, -w, -c, -m, -L, -lwc
Single line with/without newline12All flag combinations
Multiple lines4Default, -l, -w, -lwc
Whitespace-heavy5Default, -l, -w, -c, -L
Tabs2Default, -L
Only newlines / only spaces7Default, -l, -w, -c, -L
Binary/control chars3Default, -l, -w
UTF-84Default, -m, -c, -L
Large line (10K chars)2Default, -L
Multi-file1Two-file output with total
Stdin1Piped input

Three-state word classification (whitespace / word byte / other) is necessary for correct binary parity. GNU wc in UTF-8 mode treats invalid byte sequences and control characters as neither whitespace nor word starters, and the LUT-based classification matches this behavior.

How this was built

The implementation was built using Factory with the autoresearch pattern. The optimization phase ran 12 experiments in a single session with structured keep/discard decisions, JSONL experiment logging, and automatic rollback on regressions.

The test-to-source ratio is roughly 0.2:1 by line count (164 test lines vs 820 source lines), but the 13 parity tests each invoke both GNU wc and wc-rs on the same input and compare full output, which gives broad behavioral coverage despite the low line count.

Reproducibility

git clone https://github.com/sagaragas/wc-rs.git
cd wc-rs
 
# Build
cargo build --release
 
# Run parity tests
cargo test --release
 
# Generate benchmark corpus and run benchmarks
bash bench/generate_corpus.sh
bash bench/benchmark.sh

The benchmark script measures 5-run medians for default, -l, -w, -c, -m, and -L modes, comparing wc-rs against GNU wc. Install cw (cargo install cw) for three-way comparison.

Source