Mar 2026 • 12 min read
Rewriting a Python web log parser in Go
I rewrote a Python web log parser as a Go HTTP service. On 1.89 million lines of real NASA access logs, the Go version parses 3.3x faster than the Python baseline. A parity harness verifies both implementations produce identical output before allowing performance claims.
Rewrite study
PublishedHead-to-head on 1.89M lines of real NASA Kennedy Space Center access logs (July 1995). Go parses 3.3x faster than Python.
Speedup
3.3x faster
Go throughput
~485K lines/sec
Memory (RSS)
~723 MiB (Go) vs ~718 MiB (Py)
Dataset
1.89M lines (NASA KSC)
Parity
9/9 fields match
- Go parses 1.89M real NASA access logs in 3.9s vs 13.0s for Python -- a 3.3x speedup.
- Both implementations use the same regex and produce identical workload counts (1,887,880 matched, 3,834 malformed).
- Peak RSS is comparable (~723 MiB Go vs ~718 MiB Python). The speedup comes from CPU, not memory.
- The benchmark harness gates performance claims behind field-level parity checks and verified fairness controls.
I rewrote a Python CLI log parser as a Go HTTP service and benchmarked the result on 1.89 million lines of real NASA access logs. Both implementations use the same named-group regex for Apache/Nginx combined log format. The Go version parses 3.3x faster than the Python baseline.
The more interesting part is how the benchmark enforces honesty. A cross-language parity harness compares nine output fields between Go and Python, and only unlocks performance claims when every field matches. The harness also verifies its own fairness controls -- cache posture, CPU pinning, iteration symmetry -- and sets claimable: false if any control cannot be proven.
Correctness came first
Before looking at speed, the harness checks that both implementations did the same work on the same input. The parity gate compares two categories of fields:
Workload accounting -- the raw counts that prove both parsers saw the same data:
input_bytes: exact byte count from a counting reader wrapping the input streamtotal_lines,matched_lines,filtered_lines,rejected_lines: line-level dispositionrow_count: number of parsed records passed to the summary stage
Canonical summary -- the aggregated output that users actually see:
requests_total: total matched requestsrequests_per_sec: derived from the timestamp span in the input data (first record to last), not wall-clock timeranked_requests: deterministic ranking by count descending, then path ascending, then method ascending
If any field drifts between baseline and rewrite, the harness sets performance_claims_allowed: false and the run fails. You cannot get a benchmark number out of this system without first proving correctness.
On the NASA dataset, both implementations agree exactly: 1,887,880 matched, 1 filtered, 3,834 malformed. The 3,834 malformed lines are 1995-era requests with unencoded spaces in the URL (e.g., GET /htbin/wais.pl?orbit sts71 HTTP/1.0), which both parsers reject because the regex requires \S+ for each request field.
The parser
Both implementations use the same regex pattern. The Go version compiles it once at package init:
var combinedLogRegex = regexp.MustCompile(
`^(?P<remote>\S+)\s+` +
`(?P<ident>\S+)\s+` +
`(?P<auth>\S+)\s+` +
`\[(?P<timestamp>[^\]]+)\]\s+` +
`"(?P<method>\S+)\s+(?P<path>\S+)\s+(?P<protocol>\S+)"\s+` +
`(?P<status>\d+)\s+` +
`(?P<size>\d+|-)`,
)Go's regexp package uses a Thompson NFA, which guarantees linear time in the input length but does not support backreferences or lookahead. For this pattern that is fine -- combined log format is regular. The Python baseline uses re.compile with the same named groups, which uses a backtracking engine, but the pattern has no pathological backtracking paths so both engines behave linearly here.
The engine wraps a bufio.Scanner around a counting reader to track exact input bytes without off-by-one errors from newline counting:
counter := &countingReader{reader: r}
scanner := bufio.NewScanner(counter)
for scanner.Scan() {
line := scanner.Text()
result.TotalLines++
rec, err := e.parseLine(line)
if err != nil {
result.Malformed++
continue
}
if rec == nil {
result.Filtered++
continue
}
result.Matched++
result.Records = append(result.Records, *rec)
}
result.InputBytes = counter.countEach parsed line produces a Record struct with five fields: Timestamp, Method, Path, Status, Size. In Go these are value types that live contiguously in the Records slice backing array. The Python baseline produces equivalent tuples, but each one is a separate heap allocation with pointer indirection. At 1.89 million lines, the allocation pressure difference is measurable.
Health check paths (/healthz, /readyz, /ping, /alive, /_health) are filtered and tracked separately. The filter list is intentionally short and uses exact-match plus prefix-match, not another regex.
Where the 3.3x comes from
The regex match itself is not where most of the time goes. Profiling the Go version shows two dominant costs:
Timestamp parsing. Each line's timestamp (27/Mar/2026:22:35:03 -0700) must be parsed into a structured time value. Go's time.Parse compiles the layout string into a sequence of fixed-width field extractors at call time. Python's datetime.strptime does the same conceptual work but through a more general code path with per-call format string interpretation. On 1.89 million calls, Go's approach is significantly cheaper.
Per-line allocation. Go's Record is a value type (struct with no pointer indirection for the fixed fields). The Records slice grows with append, which uses geometric doubling -- the allocator makes O(log n) allocation calls for n records. Python builds a list of tuples where each tuple is a separate heap object. CPython's reference-counting allocator handles this correctly but does more work per object.
The regex match cost is similar in both languages for this pattern. Go's regexp is not especially fast (it prioritizes worst-case guarantees over best-case throughput), and Python's re with a non-pathological pattern is adequate. The gap is in everything that happens after the match.
Peak RSS is roughly the same (~723 MiB Go vs ~718 MiB Python), which confirms this is a CPU-time win, not a memory-efficiency win. Both implementations hold all parsed records in memory, so the working set is dominated by the record slice/list.
The canonical summary
After parsing, the summary stage aggregates records by method+path and produces a deterministic ranking:
sort.SliceStable(sum.RankedRequests, func(i, j int) bool {
if sum.RankedRequests[i].Count != sum.RankedRequests[j].Count {
return sum.RankedRequests[i].Count > sum.RankedRequests[j].Count
}
if sum.RankedRequests[i].Path != sum.RankedRequests[j].Path {
return sum.RankedRequests[i].Path < sum.RankedRequests[j].Path
}
return sum.RankedRequests[i].Method < sum.RankedRequests[j].Method
})The three-level sort (count descending, path ascending, method ascending) with SliceStable guarantees identical output for identical input. This matters for the parity gate: if the ranking were nondeterministic, you could not diff Go output against Python output and call it a correctness check.
requests_per_sec is derived from the timestamp span in the data (first record to last record), not from wall-clock parsing time. An earlier version used wall-clock duration, which made the rate change depending on machine speed and broke benchmark parity.
The same Summary struct flows into API responses, HTML reports, and benchmark parity checks. There is no separate rendering path that could quietly disagree with the canonical computation.
Benchmark methodology
The cross-language harness in internal/bench does not just time two programs. It enforces that they produce the same output before comparing their speed, and it verifies its own fairness controls.
Fairness controls
Five controls must be symmetric between baseline and rewrite:
| Control | What it enforces |
|---|---|
warmup_iterations | Same number of discarded warmup runs |
measured_iterations | Same number of timed runs |
cache_posture | Both start from the same file cache state (cold or warm) |
concurrency | Serialized execution; no parallel advantage |
max_procs | CPU pinning via taskset -c 0 to prevent migration |
Each control is not just declared but verified after execution. The harness checks that warmup and measured iteration counts match what was declared, that cache drops succeeded, that taskset was applied. If any control cannot be proven, claimable goes to false:
report.Claimable = report.Symmetric
for _, evidence := range controlEvidence {
if !evidence.Claimable {
report.Claimable = false
break
}
}Execution order
To prevent systematic ordering bias, the harness alternates which implementation runs first in each round. The starting order is determined by hashing the scenario ID:
func scenarioStartsWithBaseline(scenarioID string) bool {
sum := sha256.Sum256([]byte(strings.TrimSpace(scenarioID)))
return sum[len(sum)-1]%2 == 0
}Even rounds reverse the order. This means neither implementation consistently benefits from a warmer CPU cache or scheduler state.
Resource measurement
Wall time is measured with time.Since. CPU time and peak RSS come from getrusage via the process state after cmd.Run() completes:
func processUsage(state *os.ProcessState) (float64, int64) {
usage, ok := state.SysUsage().(*syscall.Rusage)
if !ok {
return 0, 0
}
cpu := durationFromTimeval(usage.Utime) + durationFromTimeval(usage.Stime)
return cpu.Seconds() * 1000, int64(usage.Maxrss)
}This captures user + system CPU time and the high-water RSS mark for each iteration, which is more precise than external timing tools for short-lived processes.
Results
Both parsers ran on the NASA Kennedy Space Center HTTP access logs from July 1995: 1,891,715 lines of real production web server traffic, freely redistributable from the Internet Traffic Archive.
Head-to-head
| Go 1.26 | Python 3.11 | Ratio | |
|---|---|---|---|
| Mean wall time | 3.91s | 13.01s | 3.3x |
| Std dev | 0.07s | 0.10s | |
| Lines/sec | ~485,000 | ~145,000 | 3.3x |
| MB/sec | 52.5 | 15.0 | 3.5x |
| Peak RSS | 723 MiB | 718 MiB | ~1x |
Ten iterations per implementation. Both processed identical workloads: 1,887,880 matched, 1 filtered, 3,834 malformed. The parity gate passed on all nine fields. Fairness controls were verified and the run was marked claimable: true.
The MB/sec ratio (3.5x) is slightly higher than the lines/sec ratio (3.3x) because Go's bufio.Scanner reads in 64 KiB chunks while the Python baseline uses default buffering. The difference is small but measurable at this scale.
What the numbers do not show
The 3.3x result measures end-to-end parsing of a specific format on a specific dataset on a specific machine. It does not generalize to:
- Other log formats. JSON logs, for example, would shift the bottleneck from regex + timestamp parsing to JSON decoding, where Python's
jsonmodule (backed by C) is competitive. - Smaller inputs. On the synthetic 5-line and 18-line corpora, the speedup was ~1.9x because process startup dominated measurement. The gap widens at scale because the per-line cost dominates.
- Optimized Python. A Python implementation using
re2bindings, compiled Cython parsing, or evenmmapwith manual field extraction would close the gap. The 3.3x reflects CPython 3.11'sre.match+datetime.strptimeon this workload, not an optimized Python implementation.
Environment
All numbers from one machine: Intel i5-12500T (12 logical cores), 64 GB RAM, Debian 12, Linux 6.17, Go 1.26, Python 3.11.2. Both pinned to CPU 0 via taskset -c 0.
The HTTP service
The Go version is not just a parser -- it is an HTTP service that accepts log file uploads and returns structured results. The service is a single binary with zero dependencies outside the Go standard library.
POST /v1/analyses accepts multipart uploads. The handler validates content type, checks format support, rejects unsafe filenames, and enforces a size limit. If the queue is full, it returns 429 with a Retry-After header. Submissions during the startup window get 503 from the /readyz gate.
Jobs go through a state machine: queued -> running -> succeeded | failed | expired. Idempotent submissions are supported via an Idempotency-Key header. The idempotency check uses a mutex around the reservation path to prevent concurrent same-key requests from creating duplicates.
The report surface at /reports/{id} renders self-contained HTML with inline SVG charts. No CDN, no external fonts, no JavaScript fetches. The page works offline.
The design constraint that matters most: the canonical summary struct is the single source of truth. The same struct flows into /v1/analyses/{id}/summary (JSON), /reports/{id} (HTML), and the benchmark parity checker. There is no code path where the API response can disagree with the browser report.
Limitations
-
Single host. All numbers come from one i5-12500T. Different CPUs, operating systems, and filesystems would give different numbers. ARM, for example, would change the relative cost of Go's
time.Parseversus Python'sstrptime. -
Combined log format only. The parser handles Apache/Nginx combined format. JSON logs, Caddy format, or anything else would need a new parser function and a separate benchmark scenario.
-
Vanilla CPython baseline. The 3.3x reflects CPython 3.11's standard library
reanddatetime.strptime. A Python rewrite usinggoogle-re2, Cython-compiled parsing, or even PyPy would produce a different ratio. This is a comparison against unoptimized Python, not against the best Python can do. -
No persistence. Jobs live in memory with a configurable retention period (
PARSERGO_RETENTION, defaults to 24h). Restarting the service loses everything. This is adequate for a single-user tool, not for production deployment. -
In-process benchmark. The cross-language harness runs each implementation as a subprocess and collects
getrusagemetrics. It does not use hardware performance counters or isolated cgroup namespaces. The fairness controls (CPU pinning, cache posture, serialized execution) mitigate but do not eliminate environmental noise.
Reproducibility
The repo includes Go-native benchmarks that anyone can run. A 10,000-line NASA sample is committed at benchmark/corpora/nasa/nasa_10k.log. For the full dataset:
curl -o /tmp/NASA_access_log_Jul95.gz \
ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
gunzip /tmp/NASA_access_log_Jul95.gz
mv /tmp/NASA_access_log_Jul95 /tmp/nasa_jul95
go test -bench=BenchmarkParse_NASAFull \
-benchmem ./internal/analysis/For the cross-language parity harness:
BENCH_BASELINE_PYTHON=/path/to/python \
BENCH_LEGACY_REPO=/path/to/web-log-parser \
go run ./cmd/bench run --scenario synthetic-smallAll test suites pass with go test ./....
How this was built
I did not write this by hand. The entire project -- service, tests, benchmark harness -- was built using Factory mission mode. The mission system broke the work into five milestones (foundation, service slice, hardening, benchmark harness, publication), assigned worker sessions to each feature, and ran automated scrutiny reviews and user-testing validators after every implementation step.
Some things the scrutiny pass caught that I would have missed in manual review:
- The job store's
Update()was replacing the entire record, zeroing outCreatedAton state transitions. Fix: fetch-then-mutate. requests_per_secwas derived from wall-clock duration, making it nondeterministic across runs. Fix: derive from the timestamp span in the input data.- The
/reportsindex iterated a Go map, producing random ordering. Fix: explicitsort.SlicebyCreatedAtdescending. - Non-file multipart fields had no size limit. Fix:
io.LimitReaderat 1 KiB. - The idempotency check was a non-atomic lookup-then-store. Fix: mutex around the reservation path.
These are not exotic bugs. They are the kind of thing that survives code review because each one looks fine in isolation. The mission system caught them because it tested against specific behavioral contracts after every feature, automatically.
Total output: 38 feature sessions across 5 milestones, 20 source files plus 13 test files, 8 passing test suites, and a working HTTP service with a benchmark harness. One orchestrated run.
Source
The full source is at github.com/sagaragas/parser-go. Apache-2.0 licensed. 20 source files, 13 test files, ~12,000 lines of Go, zero external dependencies.