OpenAI Parameter Golf Research Sprint

Project information

Category: ML Compression, Systems Engineering, Research Tooling
Role: Solo researcher and engineer
Project date: March 2026
Stack: Python, PyTorch, PowerShell, WPF, Runpod, RTX 3090, H100
Repository: GitHub project link

What this project was

OpenAI's Parameter Golf challenge was a model-compression contest: fit a language model into a 16,000,000 byte artifact, keep record-track training under 10 minutes on 8x H100s, and optimize the final tokenizer-agnostic bits-per-byte score after the required quantized roundtrip.

I treated this like a solo research sprint. That meant building the experimentation loop, ranking ideas honestly, killing dead ends quickly, and then validating the strongest branch on rented H100 hardware. The full code and tooling used for the project lives in the public repository here.

Challenge brief

The hard part of this challenge was that model quality was not enough on its own. The actual objective was the post-roundtrip artifact: train a model that still scores well after int8 export and zlib compression, while staying under the size cap and within the wallclock budget.

Artifact cap: The final submission had to stay under 16,000,000 bytes.
Compute cap: Record-track runs had to finish inside a 10-minute training budget on 8x H100s.
Metric: Lower bits-per-byte on FineWeb validation was better.
Trap: Great pre-quant numbers could still fail once the model was actually exported.

What I built

This was not just model tweaking. I ended up building a small research platform around the challenge so I could move faster without lying to myself.

Local experimentation harness: PowerShell launchers for fast sweeps, smoke tests, fixed-step runs, and remote pullback.
Training Run Monitor: A WPF desktop monitor that parsed logs, showed run phases, surfaced ETAs, and made long local runs easier to manage.
Export analysis tools: Scripts for tensor sensitivity, MLP permutation experiments, and targeted residual allocation.
Remote workflow: Runpod launch, sync, logging, and mandatory local artifact recovery after paid runs.

Training run monitor showing phase progression w/ ETA.

Iterative process

Stage 1: build a trustworthy local ruler. Early 3090 runs were noisy and easy to misread, so I shifted from a loose wallclock loop to a fixed-step exact-roundtrip evaluation path. That single change improved the research quality more than any one architecture tweak, because it stopped me from ranking ideas on noise.

Stage 2: pressure test ideas instead of falling in love with them. I tried compression-aware training, sidecar eval ideas, shared-block recurrence, sparse attention, ternary shaping, residual-budget tuning, export-side heuristics, and tokenizer changes. Most of the flashy ideas lost once they were measured cleanly.

Stage 3: spend bytes where they actually mattered. The local story changed once I stopped treating a tiny under-cap model as the frontier. Near-cap dense models plus better export discipline mattered more than many of the clever micro-tricks.

Stage 4: validate the branch remotely. I used H100 runs to separate local research signals from ideas that genuinely held up on the published full-data track.

Key findings

Compression-aware training was a real first-order win locally, but it did not transfer cleanly to the smaller legal full-data H100 branch.
SP4096 plus near-cap dense scaling was the strongest local research direction, reaching a fixed-step local leader of 1.89258040 on the retokenized subset.
Export-side targeted residual allocation transferred better than expected. On full-data 1x H100, it slightly beat the plain control while staying under cap.
The bottleneck was not always raw model quality. An 8x H100 run hit 1.19816494, but the artifact exploded to 30.9 MB, making it invalid despite the strong score.
Process discipline matters in paid-cloud research. After losing a remote run to low credits, I made local pullback of logs and artifacts a non-negotiable rule.

Research Milestones

Lower bits-per-byte is better. Artifact size must stay under the 16,000,000 byte cap.

Local subset Legal H100 Over cap

Best Local Subset Research Leader

SP4096 Near-Cap Dense

Fixed-step 3090 research winner used to rank ideas before remote validation.

BPB 1.89258040

Artifact 15,906,874 bytes

Best 1x H100 Result Submission-Legal

SP1024 9x512 + Targeted Residual

Best recovered full-data H100 run that stayed safely under the artifact cap.

BPB 1.31661720

Artifact 14,912,837 bytes

Best 8x H100 Score Invalid Size

SP1024 14x576 Compression/Grid

Strongest recovered score overall, but it failed the artifact cap badly.

BPB 1.19816494

Artifact 30,904,580 bytes

Results snapshot

These numbers mattered because they showed three different things: a strong local research branch, a legal remote control, and a high-quality oversized run that exposed the next bottleneck.

Best local subset result: 1.89258040 bpb on a near-cap SP4096 branch, useful as a research signal but not directly submission-ready.
Best legal full-data 1x H100 result: 1.31661720 bpb at 14,912,837 bytes on SP1024 9x512 with targeted residual export.
Best recovered 8x H100 score: 1.19816494 bpb, but invalid because the artifact landed at 30,904,580 bytes.

Branch Leaderboard

Best measured result from each major branch I explored. Lower bits-per-byte is better.

SP4096 Near-Cap Dense + Targeted Residual

Best local research branch on the fixed-step trusted track.

1.8926

SP4096 Local Iso-Byte Dense

Tokenization pivot that made the biggest local frontier jump.

1.8933

Compression-Aware SP1024 Local Winner

Best earlier local branch before the tokenizer and near-cap pivots.

2.0609

Sidecar Eval Branch

Interesting early near-win that became unstable under cleaner measurement.

2.0613

Sparse Attention Probe

Beat the plain matched baseline, but not the dense compression-aware branch.

2.0700

Recurrent / Shared-Block Sweep

Most exciting architectural idea on paper, but clearly not competitive in this tested regime.

2.2545

Artifact Size vs Final BPB

This is the core tension of the challenge: score quality mattered, but only if the artifact stayed legal.

X-axis: Artifact size Y-axis: Lower BPB is better Red line: 16 MB limit

Why this project matters in a portfolio

This case study shows the kind of work I enjoy most: ambiguous technical constraints, incomplete information, a lot of dead-end space, and the need to build systems around the problem rather than only attack the most obvious surface-level fix.

It combined research thinking, engineering discipline, automation, metrics literacy, and a willingness to throw away ideas that did not survive clean measurement. Even without a final challenge submission, it produced a real body of evidence, tooling, and conclusions that could immediately guide a better-funded next round.

What I would do next

With a larger compute budget, the next move would be to keep the legal full-data SP1024 9x512 branch as the remote control, then reintroduce only the export-side changes that have actually transferred cleanly. The goal would be to convert the oversized 8x H100 quality signal into a cap-legal submission path, instead of starting another wide search from scratch.

Embedded Research Log

This is the working research tracker used during the challenge sprint, rendered directly from the markdown source.

Open raw markdown

Loading research log...