2026-05-09·hardwareverilogtinytapeoutfpgaasicsky130parsing· —

Building a byte-level parser on Tiny Tapeout: a design walkthrough

Tiny Tapeout lets you put a small piece of custom silicon onto a public Skywater 130nm shuttle for a few hundred dollars. Most projects on it are LED blinkers, simple counters, or visual-logic-block toys. This post walks through a more ambitious example: a JSON-grammar coprocessor that fits in two tiles, and the design decisions that made it fit.

The repo is at github.com/plawanrath/grammartile, which contains the source, tests, and a complete TT submission. This walkthrough focuses on three design patterns that generalize beyond JSON, plus the parts of the TT submission flow that aren't well documented.

What the chip does

Stream bytes in over SPI, get one bit out telling you whether the bytes form a valid JSON value. The grammar, which includes objects, arrays, strings, numbers, and the literals true/false/null, is baked into the silicon at synthesis time.

Internally:

A 64-bit one-hot register holds the NFA state.
A 16-deep × 4-bit pushdown stack tracks {} / [] nesting.
A 5-bit (29-class) character encoder maps incoming bytes to NFA-readable classes.
A mode-3 SPI slave decodes six commands: RESET, LOAD_STATE, ADVANCE, QUERY_ACCEPT, CHECKPOINT, RESTORE.

About 2,000 standard cells and ~270 flip-flops, fitting comfortably in two TT tiles (167 × 108 µm each).

Pattern 1: a bit-parallel Glushkov NFA

The classical "one-FF-per-state" hardware NFA encoding (Sidhu & Prasanna, FCCM 2001) holds the entire state in a single wide register. Each bit corresponds to one state, and multiple bits can be active at once, which captures NFA non-determinism directly in hardware. The transition function becomes an OR of "for each currently-active source state, set the destination bits implied by this byte."

Why this matters in a JSON parser: after seeing 1, the chip is both in S11_NUM_INTNZ ("we're parsing a multi-digit integer, more digits ok") and S63_ACCEPT ("a value just completed"). On the next byte:

If it's ,, only S63 has a transition. S11 drops out.
If it's 2, only S11 has a transition. S63 drops out and gets re-asserted.

Both interpretations live in the same register; the next byte arbitrates which survives. No rollback, no backtracking, no second pass.

In Verilog this looks like one always @* block with one if (state[X]) per state, each setting bits in a next_state register. The synthesizer collapses the whole thing into the right netlist. See src/nfa_engine.v.

When to use this pattern: any time you're putting a regular-language matcher (regex, lexer, simple grammar) onto silicon and the state count is ≤ your datapath width.

Pattern 2: character-class encoding is a real cost

The first version of this design used a 4-bit (16-class) character encoder. Plenty for structural punctuation and digits. It falls down on literals.

JSON's true, false, null are case-strict, so truE is invalid. That means the NFA's "after tru, expect e" transition has to fire on lowercase e only. But the number exponent in 1.5E10 accepts both e and E interchangeably. Same byte, different semantics depending on which state you're in.

The fix: split into two classes, EE_LO (lowercase only) and EE_UP (uppercase only). Plus a class per literal letter (t r u f a l s n b). Total: 29 classes, encoder moved to 5-bit.

Cost: roughly 50 extra gates in the encoder. Benefit: the chip rejects truE correctly instead of pushing the validation back to the host. The test that proves this is test_literal_case_strict in test/test.py.

The lesson: character-class encoders look like bookkeeping until you realize the whole NFA's correctness depends on them. Spend an hour on the class table before you write a single transition.

Pattern 3: shadow registers for snapshot/restore

The chip exposes two SPI commands that don't drive parsing directly: CHECKPOINT and RESTORE. They optimize for one workflow: testing many candidate byte sequences against the same starting state.

The naive way: LOAD_STATE (17 bytes over SPI: 8 bytes NFA + 1 byte stack pointer + 8 bytes stack) before each candidate. The fast way: a shadow register file on chip, and two single-byte SPI commands to copy active → shadow (CHECKPOINT) and shadow → active (RESTORE). 17 bytes of SPI traffic per candidate amortizes down to 1.

Hardware cost: one duplicate of the entire state register file, totaling 64 NFA bits + 64 stack bits + a 5-bit pointer (about 130 extra flip-flops). On Sky130 that's roughly 10% of one tile.

RESTORE also clears the sticky error latch, so a candidate that fails (e.g., one that pushed into an unbalanced bracket) doesn't poison the next candidate's evaluation. One line of Verilog, much nicer host protocol.

When to use this pattern: any time the host needs to evaluate many speculative inputs against a saved state. Shadow registers are nearly free in flip-flops compared to round-tripping state through I/O.

Submitting to Tiny Tapeout: what the docs don't cover

The TT documentation is good for the easy path (clone the template, add Verilog, push). It's quiet on the failure modes. Several show up reliably for any non-template project.

1. Verilator runs with warnings-as-errors. The TT GDS action invokes verilator -Wall inside its harden pipeline. Six benign warnings (unused upper bits of a state vector, plus unused class-ID parameters kept as documentation against the encoder) were enough to fail the GDS step. Remediation: drain unused signals into a single sink (wire _unused = ena | (|some_unused_bus);) and wrap doc-only parameters in /* verilator lint_off UNUSEDPARAM */.

2. src/config.json is mandatory and not autogenerated. It's the LibreLane configuration: clock period, placement density, layout overrides. The canonical one lives in the TT verilog template. Without it, the action's --create-user-config step crashes with Could not file configuration file src/config.{json} (sic).

3. The viewer job needs GitHub Pages set to "GitHub Actions", not "Deploy from a branch". On a private repo it 404s silently. The trap: TT's submission checker reads workflow-level status, not job-level, so a red viewer blocks the submission even when the gds job succeeded and tt_submission exists. Make the repo public, set Pages source to "GitHub Actions".

4. "Re-run failed jobs" causes duplicate-artifact errors. If only viewer is red, GitHub's "Re-run failed jobs" button reruns it in isolation, and the new github-pages artifact collides with the leftover from the previous run. Use "Re-run all jobs" or push an empty commit (git commit --allow-empty -m "ci: trigger fresh run").

Scope of this design

Worth being explicit about what V1 doesn't do, because the TT tile budget forces tradeoffs:

No \uXXXX precise hex validation. Any 4 bytes after \u are accepted. Adding precise hex would need 4 more states and a 2-bit counter.
No runtime-loadable grammar. The transition matrix is baked into the netlist. A V2 with a writable transition RAM is +2–4 tiles.
No nesting past 16. The stack is 16 entries × 4 bits.
One grammar per chip. One tapeout, one grammar.

These are scoping decisions, not bugs. Each is straightforward to lift in a follow-up; you just pay for it in tiles.

Repo

Source, the full Verilog, and the cocotb test suite (26 tests, passing under cocotb 2.0 + Icarus 13): github.com/plawanrath/grammartile.

Found this useful?