Apr 2026 • 14 min read
Rewriting wc in Rust
A Rust rewrite of GNU wc that uses AVX2 SIMD to count words at 7.5 GB/s -- 27x faster than GNU coreutils on the default mode, discovered through 12 systematic experiments where branchless scalar was a dead end and SIMD was the only path past 1 GB/s.
Rewrite study
First benchmark passSingle-file throughput (MB/s) on a 100MB generated text corpus across default, line-only, word-only, char-only, and byte-only modes. Multi-file throughput on 1,000 files (~53 MB total). Three-way comparison against GNU wc and cw (Freaky/cw).
Default mode
27x faster than GNU wc
Parity
13/13 tests
Word counting
7,490 MB/s
Multi-file (1000)
17x faster than GNU wc
vs cw (Rust)
13.5x faster on default
Corpus
1 generated, 1 machine
- On the default mode (-lwc), wc-rs processes at 7,490 MB/s versus GNU wc at 278 MB/s (27x) and cw at 555 MB/s (13.5x).
- Three branchless scalar experiments all regressed (-25% to -30%). The CPU branch predictor is highly effective on natural text, and branchless approaches create data dependency chains that prevent speculation.
- SIMD was the only path past 1 GB/s. SSE2 (16 bytes at a time) gave 3.5x over scalar, AVX2 (32 bytes) doubled that, and fusing line+word counting into one AVX2 pass added another 36%.
- Character counting (-m) is trivially SIMD-able: count non-continuation bytes with a single AVX2 comparison. Result: 6,168 MB/s (21x GNU wc).
- Multi-file workloads benefit from rayon parallelism: 1,000 files in 11ms versus GNU wc at 191ms.
- This was a narrow result: one generated corpus, one x86_64 machine, and a benchmark harness with 5-run medians. The generated corpus is pure ASCII; UTF-8-heavy or binary-heavy workloads could change the throughput profile.
I rewrote GNU coreutils wc in Rust and ran 12 systematic experiments to find the fastest counting strategy. The result processes the default mode (lines, words, bytes) at 7,490 MB/s -- 27x faster than GNU wc and 13.5x faster than cw, the existing Rust alternative.
The interesting part was not the final number. It was discovering that the textbook optimization (branchless word counting) made things slower, and that SIMD was the only path past 1 GB/s. Three branchless experiments all regressed by 25-30% before the first SIMD experiment jumped throughput from 628 MB/s to 2,231 MB/s in one step.
The result in one view
100 MB generated text corpus, single file, 5-run median, 12-core x86_64 machine:
| Mode | GNU wc | cw (Freaky/cw) | wc-rs | vs GNU wc |
|---|---|---|---|---|
| Default (-lwc) | 278 MB/s | 555 MB/s | 7,490 MB/s | 27x |
| Words (-w) | 280 MB/s | 561 MB/s | 6,991 MB/s | 25x |
| Lines (-l) | 8,738 MB/s | 6,991 MB/s | 9,533 MB/s | 1.1x |
| Chars (-m) | 294 MB/s | N/A | 6,168 MB/s | 21x |
| Bytes (-c) | fstat | fstat | fstat | tied |
Multi-file (1,000 files, ~53 MB total):
| GNU wc | cw | wc-rs |
|---|---|---|
| 191 ms | 102 ms | 11 ms |
The key comparison is default mode, where wc spends most of its time on the word-counting state machine. Line counting is already near memory bandwidth for all tools. Byte counting is just fstat.
What was rewritten
The Rust implementation replaces GNU wc's C code with 586 lines of Rust (counting core) and 234 lines (CLI). The architecture:
| Aspect | GNU wc (C) | wc-rs (Rust) |
|---|---|---|
| Word counting | Byte-by-byte with iswspace() / mbrtowc() | AVX2 SIMD: 32 bytes at a time with cmpgt + movemask + popcount |
| Line counting | Byte-by-byte with \n check | Fused into the AVX2 word loop via cmpeq |
| Char counting | mbrtowc() per byte | AVX2: count non-continuation bytes (b > 0xBF signed) |
| Byte counting | fstat fast path | Same |
| I/O | Buffered read (128K) | mmap for regular files, 256K buffered fallback for stdin/pipes |
| Multi-file | Sequential | rayon::par_iter across all CPU cores |
| CLI | Custom C argument parsing | clap derive |
| Dependencies | libc | clap, memchr, memmap2, rayon |
Parity tests verify that wc-rs output matches GNU wc exactly across 13 scenarios: empty files, single-line, multi-line, no trailing newline, whitespace-heavy, tabs, binary data, UTF-8, control characters, multi-file totals, and stdin.
The word-counting problem
Word counting is the bottleneck in wc. Lines are cheap (scan for \n). Bytes are free (fstat). But words require tracking a state machine: you need to know whether the previous byte was inside a word to decide whether the current byte starts a new one.
GNU wc handles this with locale-aware character classification using mbrtowc() and iswspace(). For UTF-8 locales, this means decoding multibyte sequences on every byte, even when the input is pure ASCII. The cost adds up: ~280 MB/s on this corpus.
The naive Rust implementation replaces the locale machinery with a 256-byte lookup table that classifies each byte as whitespace, word character, or "other" (control bytes, continuation bytes). Three categories instead of two because GNU wc treats control characters as neither starting nor ending words -- they pass through without changing the word state. This matters for binary-adjacent data parity.
static BYTE_CLASS: [u8; 256] = {
let mut table = [0u8; 256];
table[9] = IS_WHITESPACE; // tab
table[10] = IS_WHITESPACE | IS_NEWLINE; // newline
// ... other whitespace
table[32] = IS_WHITESPACE; // space
let mut i = 0x21;
while i <= 0x7E { table[i] = IS_WORD; i += 1; } // printable ASCII
let mut i = 0xC2;
while i <= 0xF4 { table[i] = IS_WORD; i += 1; } // UTF-8 lead bytes
table
};This baseline runs at 517 MB/s. A few incremental improvements -- larger buffers (256K), memchr for newline counting, mmap for file I/O -- pushed it to 628 MB/s. All straightforward, all keeping the same byte-by-byte word state machine.
The dead end: branchless word counting
The word state machine has branches:
if whitespace -> leave word
if word_byte && !in_word -> enter word, increment count
else -> no change (control chars, continuation bytes)
On every byte. The textbook move is to make this branchless: encode the state transitions in a lookup table indexed by (current_state, byte_class), extract the new state and word increment with bit masks, avoid conditional jumps entirely.
I tried three variants:
-
Two-table lookup --
BYTE_CLASS[byte]for classification, thenTRANSITION[state * 4 + class]for the state update. Two memory accesses per byte instead of branches. -
Fused 512-byte table --
TABLE[state * 256 + byte]directly encodes both the new state and the word increment. Single lookup per byte, no branches at all. -
Scalar bitmask building -- Build a 64-bit bitmask of "is this byte a word character?" for each 64-byte chunk, then detect word starts as not-word-to-word transitions using shifts and
popcount.
All three regressed. Badly.
| Experiment | Throughput | Change |
|---|---|---|
| Branch-based baseline | 628 MB/s | -- |
| Two-table branchless | 454 MB/s | -28% |
| Fused 512-byte table | 442 MB/s | -30% |
| Scalar bitmask | 472 MB/s | -25% |
The reason: the CPU branch predictor is highly effective on natural text. English prose has predictable word/space patterns -- words average ~5 characters, spaces ~1 character. The predictor learns "stay in current state" and gets it right roughly 84% of the time. Well-predicted branches are essentially free.
Branchless approaches replace those free branches with guaranteed data dependency chains. Every byte's result depends on the previous byte's state update. The CPU cannot speculate ahead; it must wait for the table lookup and bit extraction to complete before processing the next byte. The latency of a single L1 cache hit (~4 cycles) becomes the throughput bottleneck.
This is the core finding from the scalar phase: on predictable data, branches beat branchless. The optimization literature often treats "branchless" as universally faster, but that assumes high misprediction rates. For wc on text, the prediction rate is high enough that the branch predictor outperforms any lookup-table scheme.
The breakthrough: SIMD
If you cannot make each byte faster, process more bytes at once.
The insight: word counting reduces to detecting transitions in a bitmask. If I can classify 32 bytes as "word" or "not-word" simultaneously using SIMD, then word starts are just (~word_bits << 1) & word_bits -- positions where the current byte is a word byte and the previous was not. That is a shift, a NOT, an AND, and a popcount. No per-byte state machine at all.
SSE2: 16 bytes at a time (628 -> 2,231 MB/s)
The first SIMD implementation uses SSE2 (baseline on all x86_64):
- Load 16 bytes into a 128-bit register
- Classify word bytes:
cmpgt(v, 0x20)catches printable ASCII (0x21-0x7F), OR'd with a range check for UTF-8 lead bytes (0xC2-0xF4) movemaskextracts a 16-bit integer where each bit indicates whether the corresponding byte is a word character- Word starts:
((!word_bits) << 1 | carry) & word_bits popcountthe result
A single carry bit propagates the word state across chunk boundaries. The entire loop body is a handful of SIMD instructions plus scalar bit manipulation -- no per-byte branches, no table lookups, no data dependency chains between bytes within the same chunk.
Result: 2,231 MB/s, a 3.5x jump over the scalar baseline.
AVX2: 32 bytes at a time (2,231 -> 5,519 MB/s)
Same algorithm, double the register width. With AVX2 (runtime-detected), the movemask produces a 32-bit integer, and the bit manipulation processes 32 bytes per iteration. The throughput improvement is nearly linear with register width.
Fused line+word counting (5,519 -> 7,490 MB/s)
At 5.5 GB/s, the bottleneck shifted from computation to memory bandwidth. The default mode was running two passes: memchr for newlines (SIMD internally) and the AVX2 word loop. Two reads of the same data.
Fusing them into a single AVX2 loop -- adding a cmpeq for newline detection alongside the word classification -- halved the memory bandwidth requirement and added 36%:
unsafe fn count_words_lines_avx2(data: &[u8]) -> (u64, u64) {
// ... setup omitted
for chunk in data.chunks_exact(32) {
let v = _mm256_loadu_si256(chunk.as_ptr() as *const __m256i);
// Line counting: compare each byte against '\n'
let nl_bits = _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, newline)) as u32;
lines += nl_bits.count_ones() as u64;
// Word byte detection: printable ASCII OR UTF-8 lead bytes
let gt_x20 = _mm256_cmpgt_epi8(v, x20);
let utf8_lead = _mm256_and_si256(
_mm256_cmpgt_epi8(v, xc1),
_mm256_cmpgt_epi8(xf5, v),
);
let word_bits = _mm256_movemask_epi8(
_mm256_or_si256(gt_x20, utf8_lead)
) as u32;
// Word starts: previous byte was not a word byte, current is
let word_starts = ((!word_bits) << 1 | prev_carry) & word_bits;
words += word_starts.count_ones() as u64;
prev_carry = if word_bits & (1 << 31) != 0 { 0 } else { 1 };
}
(words, lines)
}At 7,490 MB/s on a 100 MB file, this is approaching the practical memory bandwidth ceiling on this machine. Further loop unrolling (64 bytes per iteration, two AVX2 loads) showed no improvement, confirming the bottleneck is memory throughput rather than instruction count.
Character counting
GNU wc's -m mode counts Unicode codepoints using mbrtowc(). In UTF-8, codepoint counting is equivalent to counting bytes that are not continuation bytes (0x80-0xBF). The rest -- ASCII bytes and UTF-8 lead bytes -- each start one codepoint.
With AVX2, this is a single comparison per 32 bytes:
let threshold = _mm256_set1_epi8(-65i8); // 0xBF as signed
let non_cont = _mm256_cmpgt_epi8(v, threshold);
chars += (_mm256_movemask_epi8(non_cont) as u32).count_ones() as u64;Bytes > 0xBF in signed representation are 0x00-0x7F (ASCII) and 0xC0-0xFF (lead bytes). One comparison, one movemask, one popcount. Result: 6,168 MB/s, or 21x GNU wc.
Multi-file parallelism
For wc *.log workloads, rayon distributes file counting across all CPU cores. Each file gets its own mmap + AVX2 counting pass, and results are collected in original order for sequential printing (matching GNU wc's output format). On 1,000 files (~53 MB total), this takes 11 ms versus 191 ms for GNU wc.
The implementation falls back to sequential processing when stdin (-) is in the file list, since stdin cannot be parallelized.
The experiment trajectory
All 12 experiments were run using an autoresearch loop: make one focused change, measure 5-run median throughput on the 100 MB corpus, run parity tests, keep improvements, revert regressions. Dead ends are recorded with hypotheses and rollback reasons so they are never revisited.
| # | Experiment | Throughput | Status |
|---|---|---|---|
| 1 | Baseline: byte-by-byte, 64K buffer | 517 MB/s | keep |
| 2 | 256-byte LUT for byte classification | 555 MB/s | keep (+7%) |
| 3 | 256K buffers | 586 MB/s | keep (+6%) |
| 4 | memchr for newlines, separate word loop | 620 MB/s | keep (+6%) |
| 5 | mmap for regular files | 628 MB/s | keep (+1%) |
| 6 | Branchless: two-table transition | 454 MB/s | discard (-28%) |
| 7 | Branchless: fused 512-byte table | 442 MB/s | discard (-30%) |
| 8 | Scalar bitmask building | 472 MB/s | discard (-25%) |
| 9 | SSE2 SIMD word counting (16 bytes) | 2,231 MB/s | keep (+255%) |
| 10 | AVX2 upgrade (32 bytes) | 5,519 MB/s | keep (+147%) |
| 11 | Fused line+word AVX2 pass | 7,490 MB/s | keep (+36%) |
| 12 | 64-byte loop unroll | 7,490 MB/s | discard (no gain) |
The trajectory has a clear shape: incremental scalar gains (experiments 1-5), a dead-end valley (6-8), and a step-function SIMD breakthrough (9-11) that saturated at memory bandwidth (12). The autoresearch loop surfaced the branchless dead end quickly -- three experiments with clear rollback data -- rather than letting it consume days of manual optimization effort.
Limitations
-
One corpus. All throughput numbers come from a single 100 MB generated text file of random ASCII words and spaces. Real-world text (UTF-8 heavy, binary-adjacent, tab-heavy) could shift the throughput profile. In particular, the -L (max line length) mode is still scalar because tab expansion creates a per-byte column dependency that SIMD cannot parallelize.
-
One machine. All benchmarks ran on one 12-core x86_64 Linux machine with AVX2. ARM (no AVX2), older x86 (SSE2 fallback), or machines with different memory bandwidth would produce different absolute numbers. The SSE2 fallback path exists but was not separately benchmarked.
-
Generated corpus. The benchmark corpus is pure ASCII with no multibyte characters. The SIMD word-counting logic handles UTF-8 lead bytes correctly (they start words) but continuation bytes are classified as "other" (no state change). This matches GNU wc behavior on valid UTF-8 but has not been stress-tested on malformed or heavily multibyte input.
-
5-run medians. Each data point is a 5-run median with filesystem cache warm. No confidence intervals are computed. The throughput numbers are reproducible across runs on this machine but formal statistical analysis was not performed.
-
GNU wc locale overhead. GNU wc uses
mbrtowc()andiswspace()in UTF-8 locales, which adds per-byte overhead that a byte-level classification table avoids. The 27x speedup partially reflects this locale overhead difference rather than pure algorithmic improvement. Running GNU wc underLC_ALL=Cwould narrow the gap.
Correctness
Output matches GNU wc on well-formed text inputs across 15 parity test scenarios (including stdin with each single flag and -m combinations). The tests compare full stdout output (formatting, column widths, filenames, totals) between GNU wc and wc-rs. Known limitation: -m counts non-continuation bytes, which differs from GNU wc's mbrtowc() on malformed UTF-8 sequences.
| Category | Tests | Coverage |
|---|---|---|
| Empty file | 7 | Default, -l, -w, -c, -m, -L, -lwc |
| Single line with/without newline | 12 | All flag combinations |
| Multiple lines | 4 | Default, -l, -w, -lwc |
| Whitespace-heavy | 5 | Default, -l, -w, -c, -L |
| Tabs | 2 | Default, -L |
| Only newlines / only spaces | 7 | Default, -l, -w, -c, -L |
| Binary/control chars | 3 | Default, -l, -w |
| UTF-8 | 4 | Default, -m, -c, -L |
| Large line (10K chars) | 2 | Default, -L |
| Multi-file | 1 | Two-file output with total |
| Stdin | 1 | Piped input |
Three-state word classification (whitespace / word byte / other) is necessary for correct binary parity. GNU wc in UTF-8 mode treats invalid byte sequences and control characters as neither whitespace nor word starters, and the LUT-based classification matches this behavior.
How this was built
The implementation was built using Factory with the autoresearch pattern. The optimization phase ran 12 experiments in a single session with structured keep/discard decisions, JSONL experiment logging, and automatic rollback on regressions.
The test-to-source ratio is roughly 0.2:1 by line count (164 test lines vs 820 source lines), but the 13 parity tests each invoke both GNU wc and wc-rs on the same input and compare full output, which gives broad behavioral coverage despite the low line count.
Reproducibility
git clone https://github.com/sagaragas/wc-rs.git
cd wc-rs
# Build
cargo build --release
# Run parity tests
cargo test --release
# Generate benchmark corpus and run benchmarks
bash bench/generate_corpus.sh
bash bench/benchmark.shThe benchmark script measures 5-run medians for default, -l, -w, -c, -m, and -L modes, comparing wc-rs against GNU wc. Install cw (cargo install cw) for three-way comparison.