Back to archive

Mar 2026 • 12 min read

Rewriting a Python web log parser in Go

I rewrote a Python web log parser as a Go HTTP service. On 1.89 million lines of real NASA access logs, the Go version parses 3.3x faster than the Python baseline. A parity harness verifies both implementations produce identical output before allowing performance claims.

benchmarksgorewrite studysystems

Rewrite study

Published

Project

parser-go

Baseline

Python parser @ 904f838

Rewrite

parsergo (Go) @ 8e2ef20

Language

Go

Head-to-head on 1.89M lines of real NASA Kennedy Space Center access logs (July 1995). Go parses 3.3x faster than Python.

Speedup

3.3x faster

Go throughput

~485K lines/sec

Memory (RSS)

~723 MiB (Go) vs ~718 MiB (Py)

Dataset

1.89M lines (NASA KSC)

Parity

9/9 fields match

  • Go parses 1.89M real NASA access logs in 3.9s vs 13.0s for Python -- a 3.3x speedup.
  • Both implementations use the same regex and produce identical workload counts (1,887,880 matched, 3,834 malformed).
  • Peak RSS is comparable (~723 MiB Go vs ~718 MiB Python). The speedup comes from CPU, not memory.
  • The benchmark harness gates performance claims behind field-level parity checks and verified fairness controls.

I rewrote a Python CLI log parser as a Go HTTP service and benchmarked the result on 1.89 million lines of real NASA access logs. Both implementations use the same named-group regex for Apache/Nginx combined log format. The Go version parses 3.3x faster than the Python baseline.

The more interesting part is how the benchmark enforces honesty. A cross-language parity harness compares nine output fields between Go and Python, and only unlocks performance claims when every field matches. The harness also verifies its own fairness controls -- cache posture, CPU pinning, iteration symmetry -- and sets claimable: false if any control cannot be proven.

Horizontal bar chart comparing Go and Python wall time on 1.89M lines of NASA access logs. Go finishes in 3.91 seconds, Python in 13.01 seconds.
Mean wall time on 1.89M lines of NASA KSC access logs (July 1995). Go is 3.3x faster; peak RSS is comparable.

Correctness came first

Before looking at speed, the harness checks that both implementations did the same work on the same input. The parity gate compares two categories of fields:

Workload accounting -- the raw counts that prove both parsers saw the same data:

  • input_bytes: exact byte count from a counting reader wrapping the input stream
  • total_lines, matched_lines, filtered_lines, rejected_lines: line-level disposition
  • row_count: number of parsed records passed to the summary stage

Canonical summary -- the aggregated output that users actually see:

  • requests_total: total matched requests
  • requests_per_sec: derived from the timestamp span in the input data (first record to last), not wall-clock time
  • ranked_requests: deterministic ranking by count descending, then path ascending, then method ascending

If any field drifts between baseline and rewrite, the harness sets performance_claims_allowed: false and the run fails. You cannot get a benchmark number out of this system without first proving correctness.

Grid showing nine parity-checked fields: six workload-accounting fields and three canonical summary fields, all passing.
All nine parity fields matched between Go and Python on the NASA dataset. Both produced 1,887,880 matched lines, 1 filtered, 3,834 malformed.

On the NASA dataset, both implementations agree exactly: 1,887,880 matched, 1 filtered, 3,834 malformed. The 3,834 malformed lines are 1995-era requests with unencoded spaces in the URL (e.g., GET /htbin/wais.pl?orbit sts71 HTTP/1.0), which both parsers reject because the regex requires \S+ for each request field.

The parser

Both implementations use the same regex pattern. The Go version compiles it once at package init:

var combinedLogRegex = regexp.MustCompile(
    `^(?P<remote>\S+)\s+` +
        `(?P<ident>\S+)\s+` +
        `(?P<auth>\S+)\s+` +
        `\[(?P<timestamp>[^\]]+)\]\s+` +
        `"(?P<method>\S+)\s+(?P<path>\S+)\s+(?P<protocol>\S+)"\s+` +
        `(?P<status>\d+)\s+` +
        `(?P<size>\d+|-)`,
)

Go's regexp package uses a Thompson NFA, which guarantees linear time in the input length but does not support backreferences or lookahead. For this pattern that is fine -- combined log format is regular. The Python baseline uses re.compile with the same named groups, which uses a backtracking engine, but the pattern has no pathological backtracking paths so both engines behave linearly here.

The engine wraps a bufio.Scanner around a counting reader to track exact input bytes without off-by-one errors from newline counting:

counter := &countingReader{reader: r}
scanner := bufio.NewScanner(counter)
 
for scanner.Scan() {
    line := scanner.Text()
    result.TotalLines++
 
    rec, err := e.parseLine(line)
    if err != nil {
        result.Malformed++
        continue
    }
    if rec == nil {
        result.Filtered++
        continue
    }
    result.Matched++
    result.Records = append(result.Records, *rec)
}
result.InputBytes = counter.count

Each parsed line produces a Record struct with five fields: Timestamp, Method, Path, Status, Size. In Go these are value types that live contiguously in the Records slice backing array. The Python baseline produces equivalent tuples, but each one is a separate heap allocation with pointer indirection. At 1.89 million lines, the allocation pressure difference is measurable.

Health check paths (/healthz, /readyz, /ping, /alive, /_health) are filtered and tracked separately. The filter list is intentionally short and uses exact-match plus prefix-match, not another regex.

Where the 3.3x comes from

The regex match itself is not where most of the time goes. Profiling the Go version shows two dominant costs:

Timestamp parsing. Each line's timestamp (27/Mar/2026:22:35:03 -0700) must be parsed into a structured time value. Go's time.Parse compiles the layout string into a sequence of fixed-width field extractors at call time. Python's datetime.strptime does the same conceptual work but through a more general code path with per-call format string interpretation. On 1.89 million calls, Go's approach is significantly cheaper.

Per-line allocation. Go's Record is a value type (struct with no pointer indirection for the fixed fields). The Records slice grows with append, which uses geometric doubling -- the allocator makes O(log n) allocation calls for n records. Python builds a list of tuples where each tuple is a separate heap object. CPython's reference-counting allocator handles this correctly but does more work per object.

The regex match cost is similar in both languages for this pattern. Go's regexp is not especially fast (it prioritizes worst-case guarantees over best-case throughput), and Python's re with a non-pathological pattern is adequate. The gap is in everything that happens after the match.

Peak RSS is roughly the same (~723 MiB Go vs ~718 MiB Python), which confirms this is a CPU-time win, not a memory-efficiency win. Both implementations hold all parsed records in memory, so the working set is dominated by the record slice/list.

The canonical summary

After parsing, the summary stage aggregates records by method+path and produces a deterministic ranking:

sort.SliceStable(sum.RankedRequests, func(i, j int) bool {
    if sum.RankedRequests[i].Count != sum.RankedRequests[j].Count {
        return sum.RankedRequests[i].Count > sum.RankedRequests[j].Count
    }
    if sum.RankedRequests[i].Path != sum.RankedRequests[j].Path {
        return sum.RankedRequests[i].Path < sum.RankedRequests[j].Path
    }
    return sum.RankedRequests[i].Method < sum.RankedRequests[j].Method
})

The three-level sort (count descending, path ascending, method ascending) with SliceStable guarantees identical output for identical input. This matters for the parity gate: if the ranking were nondeterministic, you could not diff Go output against Python output and call it a correctness check.

requests_per_sec is derived from the timestamp span in the data (first record to last record), not from wall-clock parsing time. An earlier version used wall-clock duration, which made the rate change depending on machine speed and broke benchmark parity.

The same Summary struct flows into API responses, HTML reports, and benchmark parity checks. There is no separate rendering path that could quietly disagree with the canonical computation.

Benchmark methodology

The cross-language harness in internal/bench does not just time two programs. It enforces that they produce the same output before comparing their speed, and it verifies its own fairness controls.

Fairness controls

Five controls must be symmetric between baseline and rewrite:

ControlWhat it enforces
warmup_iterationsSame number of discarded warmup runs
measured_iterationsSame number of timed runs
cache_postureBoth start from the same file cache state (cold or warm)
concurrencySerialized execution; no parallel advantage
max_procsCPU pinning via taskset -c 0 to prevent migration

Each control is not just declared but verified after execution. The harness checks that warmup and measured iteration counts match what was declared, that cache drops succeeded, that taskset was applied. If any control cannot be proven, claimable goes to false:

report.Claimable = report.Symmetric
for _, evidence := range controlEvidence {
    if !evidence.Claimable {
        report.Claimable = false
        break
    }
}

Execution order

To prevent systematic ordering bias, the harness alternates which implementation runs first in each round. The starting order is determined by hashing the scenario ID:

func scenarioStartsWithBaseline(scenarioID string) bool {
    sum := sha256.Sum256([]byte(strings.TrimSpace(scenarioID)))
    return sum[len(sum)-1]%2 == 0
}

Even rounds reverse the order. This means neither implementation consistently benefits from a warmer CPU cache or scheduler state.

Resource measurement

Wall time is measured with time.Since. CPU time and peak RSS come from getrusage via the process state after cmd.Run() completes:

func processUsage(state *os.ProcessState) (float64, int64) {
    usage, ok := state.SysUsage().(*syscall.Rusage)
    if !ok {
        return 0, 0
    }
    cpu := durationFromTimeval(usage.Utime) + durationFromTimeval(usage.Stime)
    return cpu.Seconds() * 1000, int64(usage.Maxrss)
}

This captures user + system CPU time and the high-water RSS mark for each iteration, which is more precise than external timing tools for short-lived processes.

Results

Both parsers ran on the NASA Kennedy Space Center HTTP access logs from July 1995: 1,891,715 lines of real production web server traffic, freely redistributable from the Internet Traffic Archive.

Head-to-head

Go 1.26Python 3.11Ratio
Mean wall time3.91s13.01s3.3x
Std dev0.07s0.10s
Lines/sec~485,000~145,0003.3x
MB/sec52.515.03.5x
Peak RSS723 MiB718 MiB~1x

Ten iterations per implementation. Both processed identical workloads: 1,887,880 matched, 1 filtered, 3,834 malformed. The parity gate passed on all nine fields. Fairness controls were verified and the run was marked claimable: true.

The MB/sec ratio (3.5x) is slightly higher than the lines/sec ratio (3.3x) because Go's bufio.Scanner reads in 64 KiB chunks while the Python baseline uses default buffering. The difference is small but measurable at this scale.

What the numbers do not show

The 3.3x result measures end-to-end parsing of a specific format on a specific dataset on a specific machine. It does not generalize to:

  • Other log formats. JSON logs, for example, would shift the bottleneck from regex + timestamp parsing to JSON decoding, where Python's json module (backed by C) is competitive.
  • Smaller inputs. On the synthetic 5-line and 18-line corpora, the speedup was ~1.9x because process startup dominated measurement. The gap widens at scale because the per-line cost dominates.
  • Optimized Python. A Python implementation using re2 bindings, compiled Cython parsing, or even mmap with manual field extraction would close the gap. The 3.3x reflects CPython 3.11's re.match + datetime.strptime on this workload, not an optimized Python implementation.

Environment

All numbers from one machine: Intel i5-12500T (12 logical cores), 64 GB RAM, Debian 12, Linux 6.17, Go 1.26, Python 3.11.2. Both pinned to CPU 0 via taskset -c 0.

The HTTP service

The Go version is not just a parser -- it is an HTTP service that accepts log file uploads and returns structured results. The service is a single binary with zero dependencies outside the Go standard library.

POST /v1/analyses accepts multipart uploads. The handler validates content type, checks format support, rejects unsafe filenames, and enforces a size limit. If the queue is full, it returns 429 with a Retry-After header. Submissions during the startup window get 503 from the /readyz gate.

Jobs go through a state machine: queued -> running -> succeeded | failed | expired. Idempotent submissions are supported via an Idempotency-Key header. The idempotency check uses a mutex around the reservation path to prevent concurrent same-key requests from creating duplicates.

The report surface at /reports/{id} renders self-contained HTML with inline SVG charts. No CDN, no external fonts, no JavaScript fetches. The page works offline.

The design constraint that matters most: the canonical summary struct is the single source of truth. The same struct flows into /v1/analyses/{id}/summary (JSON), /reports/{id} (HTML), and the benchmark parity checker. There is no code path where the API response can disagree with the browser report.

Limitations

Four cards summarizing study limitations: single host, combined format only, vanilla CPython baseline, and no persistence.
The results are grounded but bounded: one machine, one format, an unoptimized Python baseline, and an in-memory job store.
  1. Single host. All numbers come from one i5-12500T. Different CPUs, operating systems, and filesystems would give different numbers. ARM, for example, would change the relative cost of Go's time.Parse versus Python's strptime.

  2. Combined log format only. The parser handles Apache/Nginx combined format. JSON logs, Caddy format, or anything else would need a new parser function and a separate benchmark scenario.

  3. Vanilla CPython baseline. The 3.3x reflects CPython 3.11's standard library re and datetime.strptime. A Python rewrite using google-re2, Cython-compiled parsing, or even PyPy would produce a different ratio. This is a comparison against unoptimized Python, not against the best Python can do.

  4. No persistence. Jobs live in memory with a configurable retention period (PARSERGO_RETENTION, defaults to 24h). Restarting the service loses everything. This is adequate for a single-user tool, not for production deployment.

  5. In-process benchmark. The cross-language harness runs each implementation as a subprocess and collects getrusage metrics. It does not use hardware performance counters or isolated cgroup namespaces. The fairness controls (CPU pinning, cache posture, serialized execution) mitigate but do not eliminate environmental noise.

Reproducibility

The repo includes Go-native benchmarks that anyone can run. A 10,000-line NASA sample is committed at benchmark/corpora/nasa/nasa_10k.log. For the full dataset:

curl -o /tmp/NASA_access_log_Jul95.gz \
  ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
gunzip /tmp/NASA_access_log_Jul95.gz
mv /tmp/NASA_access_log_Jul95 /tmp/nasa_jul95
 
go test -bench=BenchmarkParse_NASAFull \
  -benchmem ./internal/analysis/

For the cross-language parity harness:

BENCH_BASELINE_PYTHON=/path/to/python \
BENCH_LEGACY_REPO=/path/to/web-log-parser \
go run ./cmd/bench run --scenario synthetic-small

All test suites pass with go test ./....

How this was built

I did not write this by hand. The entire project -- service, tests, benchmark harness -- was built using Factory mission mode. The mission system broke the work into five milestones (foundation, service slice, hardening, benchmark harness, publication), assigned worker sessions to each feature, and ran automated scrutiny reviews and user-testing validators after every implementation step.

Some things the scrutiny pass caught that I would have missed in manual review:

  • The job store's Update() was replacing the entire record, zeroing out CreatedAt on state transitions. Fix: fetch-then-mutate.
  • requests_per_sec was derived from wall-clock duration, making it nondeterministic across runs. Fix: derive from the timestamp span in the input data.
  • The /reports index iterated a Go map, producing random ordering. Fix: explicit sort.Slice by CreatedAt descending.
  • Non-file multipart fields had no size limit. Fix: io.LimitReader at 1 KiB.
  • The idempotency check was a non-atomic lookup-then-store. Fix: mutex around the reservation path.

These are not exotic bugs. They are the kind of thing that survives code review because each one looks fine in isolation. The mission system caught them because it tested against specific behavioral contracts after every feature, automatically.

Total output: 38 feature sessions across 5 milestones, 20 source files plus 13 test files, 8 passing test suites, and a working HTTP service with a benchmark harness. One orchestrated run.

Source

The full source is at github.com/sagaragas/parser-go. Apache-2.0 licensed. 20 source files, 13 test files, ~12,000 lines of Go, zero external dependencies.