Cookpit v3.2 — Canonical Active-Number-Sequence Normalisation

The canonical normalisation referenced by rules.md K3 and K4 as cookpit-active-number-sequence-v3.2.0. This document publishes the exact tokenisation rules that turn a source recipe's text into the dash-separated active-number sequence stored at cookpit.quantitativeFingerprint.sequence.

Implementations MUST follow this profile exactly. Two independent implementations using this profile against the same source recipe MUST produce the byte-identical sequence string and therefore the byte-identical SHA-256 hash. The validator's V-FINGERPRINT-B hard criterion checks the sequence by recomputing it from the source text and comparing.

Stage scoping. This is the stage-1 source fingerprint. The AI Chef populates it at generation; the validator re-checks it at stage 2 as the source-faithfulness gate. It is computed from the source recipe, not from the cooking file. It is distinct from the file fingerprint at cookpit.attestation.fileFingerprint, which is the stage-3 file-content hash issued by the validator at attestation time and verified by stage-4 consumers. See rules.md A0.6 for the lifecycle separation between the two fingerprints.

1. What the fingerprint is

The strict quantitative fingerprint records every cooking-relevant number stated by the source recipe, in the order the source states it. The fingerprint exists so that:

Two cooking files claiming to be derived from the same source produce the same fingerprint (a determinism check).
A cooking file's source can be audited from its fingerprint (a content check).
Library indexing can group recipes by source numeric skeleton regardless of tone, format, or chef-detective rewrite.

The fingerprint is NOT a hash of the file's planned tasks; it is a hash of the SOURCE recipe's stated numbers. The chef-detective's deductive expansion does not change the fingerprint.

2. The pipeline

source PDF / text
  │
  │  STAGE A — extraction
  ▼
plain UTF-8 text
  │
  │  STAGE B — segmentation
  ▼
ordered list of source segments (header / ingredient block / method / tips)
  │
  │  STAGE C — filtering
  ▼
in-scope segments only (header excluded; tips excluded; chrome filtered)
  │
  │  STAGE D — number tokenisation
  ▼
ordered list of active numbers
  │
  │  STAGE E — sequence rendering
  ▼
dash-joined string `<n1>-<n2>-…-<nk>`
  │
  │  STAGE F — hashing
  ▼
SHA-256 hex digest

Each stage is mechanically defined below.

3. STAGE A — text extraction

Implementations MAY extract source text from PDF, web page, plain text, or other carriers. Whatever the carrier, the extraction MUST produce a single UTF-8 string with the following normalisations:

Ligatures expanded. PDF ligatures ﬁ, ﬂ, ﬀ, ﬃ, ﬄ, ﬅ, ﬆ (U+FB00..U+FB06) are expanded to their component letters (fi, fl, ff, ffi, ffl, st, st).
Curly quotes folded. ' (U+2018), ' (U+2019) → ' (U+0027); " (U+201C), " (U+201D) → " (U+0022). Source apostrophes become straight quotes for tokenisation; the original may be preserved in recipeInstructions[] schema.org pass-through.
Soft hyphens removed. U+00AD soft hyphens are stripped; the text on either side is concatenated.
Whitespace normalised. All Unicode whitespace classes collapse to single ASCII space (U+0020); leading/trailing whitespace per segment trimmed.
Diacritics preserved. Characters with diacritics (pâté, Gruyère, purée) are NOT stripped. Only the active-number tokeniser ignores letters; diacritics in identifying text matter for stage B segmentation.

The extraction is a one-shot operation per source. Re-extraction must produce byte-identical results given the same input.

4. STAGE B — segmentation

The extracted text is partitioned into ordered segments of four kinds:

Segment kind	Definition
`header`	Title block plus author, prep/cook/total/serve/dietary metadata
`ingredient-block`	The recipe's bulleted or otherwise enumerated ingredient list
`method-block`	The numbered or otherwise sequenced cooking instructions
`tips-block`	Any "Recipe tips", "Notes", "Variations" section after the method

Heuristics for segmentation:

The header is everything from the start of the source to (but not including) the first ingredient line. Ingredient lines are identified by patterns like <quantity><unit> of <noun> or <quantity> <noun> matching the source's bulleted form.
The ingredient-block ends at the first method-numbering pattern (Method, 1., Step 1, etc.) or at a bold/heading divider.
The method-block ends at the first tips-style heading (Recipe tips, Notes, Variations, Tips) or at the end of the text.
The tips-block includes everything after that heading until the end of the text.

Source-content categorisation (filtered in stage C) operates on segments individually, so segmentation accuracy matters.

5. STAGE C — filtering

Only in-scope segments contribute to the active-number sequence:

Segment kind	In scope?
`header`	NO — yield, prep time and cook time are metadata, not active cooking
`ingredient-block`	YES
`method-block`	YES
`tips-block`	NO — tips are out-of-band guidance, not active cooking

Within in-scope segments, additional filters:

Sponsored content is filtered. Lines that read as advertising (e.g. BECOMEAMEMBER, Try our app, Finish Ultimate Plus, product placement) are removed before tokenisation. Implementations SHOULD maintain a configurable allowlist of brand-name patterns; the default list at v3.2.0 publication is:
```
BECOMEAMEMBER
Try our app
You have <N> remaining read(s) today
Finish Ultimate Plus
```
Future revisions add patterns; the version cookpit-active- number-sequence-v3.2.0 freezes the v3.2.0 list above.
Paywall chrome is filtered. Lines like You have two remaining reads today (and similar paywall markers from web-extracted PDFs) are removed.
Source typos in numeric content are NOT silently corrected. If the source says 1000ml of pork stock twice in the ingredient block, both 1000 instances enter the sequence. The chef-detective corrects typos in tasks[].action text but the fingerprint sees the source's numbers as written.
Source typos in non-numeric content (carbonara's "plump garlic", chow mein's "in the work") have no effect on the sequence — they are not numbers.

6. STAGE D — number tokenisation

Within filtered in-scope text, the tokeniser walks left-to-right and emits an active-number token for each match of one of the following patterns. Patterns are matched in priority order (top to bottom); a match consumes its source span and the walk resumes immediately after it.

6.1 Patterns (priority order)

#	Pattern	Tokenisation	Example
1	`<int>g/<int>lb <int>oz`	three tokens: int, lb-int, oz-int	`1.6kg/3lb 8oz` → `1`, `6`, `3`, `8`
2	`<int>kg/<int>lb <int>oz`	three tokens
3	`<int>(\.<int>)?(kg\|g\|ml\|l\|cl)/<int><frac>?(oz\|fl oz\|pint\|in\|cm)`	metric whole + fractional digits, then imperial whole + fractional digits	`40g/1½oz` → `40`, `1`, `5`
4	`<int><frac>` (whole + Unicode fraction)	two tokens: whole, fraction-digit	`1½` → `1`, `5`
5	`<int>/<int>` (literal fraction)	two tokens: numerator, denominator	`1/4` → `1`, `4`
6	`<int>(\.<int>)`	two tokens: whole, decimal-fraction-digits	`1.5` → `1`, `5`
7	`<int>-<int>` (range)	two tokens: range-low, range-high	`2-3 minutes` → `2`, `3`
8	`<int>` (bare integer in cooking context)	one token: int	`5 minutes` → `5`

6.2 Unicode fraction normalisation

Symbol	Tokenises as
`¼`	`25`
`½`	`5`
`¾`	`75`
`⅓`	`33`
`⅔`	`66`
`⅕`	`2`
`⅖`	`4`
`⅗`	`6`
`⅘`	`8`
`⅙`	`16`
`⅚`	`83`
`⅛`	`125`
`⅜`	`375`
`⅝`	`625`
`⅞`	`875`

Rule: take the decimal expansion of the fraction, drop the leading 0., and emit the trailing digits without trailing zeros. The above table is exhaustive for fractions appearing in the source corpus; fractions outside this table are tokenised as <numerator>, <denominator> per pattern #5.

6.3 Cooking context

A bare integer (pattern #8) is tokenised as an active number IF and ONLY IF it sits within a cooking context. A cooking context is any of:

An ingredient line (in ingredient-block).
A method sentence in method-block.
A duration phrase (<n> minutes, <n> hours, <n> seconds).
A temperature phrase (<n>°C, <n>C, <n>°F, <n>F, gas mark <n>).
A multiplicity phrase (<n> batches, <n> times, <n>×, <n> sides).
A fraction-of phrase (a third, half the, quarter of, three-quarters).
A size phrase (<n>cm, <n>in).

Numbers in non-cooking context (e.g. 4.9 ratings, 27 ratings, Page 1 of 3, ISBN/SKU references) are not tokenised. Most page metadata appears in header or after the method, both already filtered in stage C.

6.4 Yield handling

Source's Serves N, Serves N-M, Makes N patterns are in the header segment and therefore filtered in stage C. Yield numbers do NOT enter the active-number sequence.

6.5 Source typo handling for numbers

Per stage C rule 3: source numeric typos are preserved verbatim. If the source has 100g/3½oz in the ingredient block AND 100g pancetta in the method (carbonara pattern), both 100s are tokenised — the fingerprint reflects the source's word count.

7. STAGE E — sequence rendering

The ordered list of tokens from stage D is rendered as a dash-separated string:

sequence = "<token1>-<token2>-…-<tokenN>"

Tokens are decimal integer strings, no leading zeros (except for literal zero, which is not expected to appear in the corpus — fractions handle the leading-zero cases).

The sequence string MUST match the regex ^[0-9]+(-[0-9]+)*$ (rule K4 / V-FINGERPRINT-A schema check).

8. STAGE F — hashing

The sequence string is encoded as UTF-8 (no BOM, no trailing newline) and hashed with SHA-256 (FIPS 180-4). The 64-character lowercase hex digest is the value of cookpit.quantitativeFingerprint.hash.value.

9. Worked example: `recipes/spaghetti_carbonara_recipe.pdf`

Source: Angela Nilsen, BBC Good Food, "Ultimate spaghetti carbonara".

Stage A — extraction

After ligature/quote/whitespace normalisation, the in-scope source text contains the ingredient block:

100g pancetta
50g pecorino cheese
50g parmesan
3 large eggs
350g spaghetti
2 plump garlic cloves, peeled and left whole
50g unsalted butter
sea salt and freshly ground black pepper

and the method block (Steps 1-12, abridged here for the parts that contribute numbers):

Step 2: Finely chop the 100g pancetta ... Finely grate 50g pecorino
        cheese and 50g parmesan ...
Step 3: Beat the 3 large eggs ...
Step 4: Add 1 tsp salt to the boiling water, add 350g spaghetti and
        when the water comes back to the boil, cook at a constant
        simmer, covered, for 10 minutes or until al dente ...
Step 5: Squash 2 peeled plump garlic cloves ...
Step 6: ... Drop 50g unsalted butter into a large frying pan or wok ...
Step 7: Leave to cook on a medium heat for about 5 minutes ...

Stage B — segmentation

Segment kind	Lines
header	"Ultimate spaghetti carbonara recipe / Angela Nilsen / Serves 4 Easy / Prep: 15 mins - 20 mins Cook: 15 mins / Discover how to make traditional…"
ingredient-block	the 8 ingredient lines above
method-block	Steps 1-12
tips-block	(none in this source)

Stage C — filtering

Header and tips removed. In-scope = ingredient-block + method-block.

Stage D — tokenisation

Source span	Pattern	Tokens
`100g pancetta` (ingredient line)	#8	`100`
`50g pecorino cheese` (ingredient line)	#8	`50`
`50g parmesan` (ingredient line)	#8	`50`
`3 large eggs` (ingredient line)	#8	`3`
`350g spaghetti` (ingredient line)	#8	`350`
`2 plump garlic cloves` (ingredient line)	#8	`2`
`50g unsalted butter` (ingredient line)	#8	`50`
(sea salt and pepper: no numbers)	-	-
`100g pancetta` (Step 2 method)	#8	`100`
`50g pecorino` (Step 2 method)	#8	`50`
`50g parmesan` (Step 2 method)	#8	`50`
`3 large eggs` (Step 3 method)	#8	`3`
`1 tsp salt` (Step 4 method)	#8	`1`
`350g spaghetti` (Step 4 method)	#8	`350`
`10 minutes` (Step 4 method)	#8	`10`
`2 peeled plump garlic cloves` (Step 5)	#8	`2`
`50g unsalted butter` (Step 6 method)	#8	`50`
`5 minutes` (Step 7 method)	#8	`5`

Stage E — sequence rendering

100-50-50-3-350-2-50-100-50-50-3-1-350-10-2-50-5

17 tokens, matching the regex ^[0-9]+(-[0-9]+)*$.

Stage F — hashing

sha256("100-50-50-3-350-2-50-100-50-50-3-1-350-10-2-50-5") =
5d1ce74bfc00489e677ab7e321a818eade01ea64fe46d0aaf67d506f7d1ceda8

This matches the value at cookpit.quantitativeFingerprint.hash.value in the published spaghetti_carbonara.v3.2.cpt.A.jsonld example at https://cookchow.com/recipes/3.2/.

10. Worked clarifications surfaced by productisation

The following clarifications were surfaced when the executable implementation in scripts/lib/source_tokeniser.py was first run against the corpus. Each is a spec tightening that resolves an ambiguity in §3–§6 without changing the published intent. The §9 worked example and the §11 self-test corpus remain canonical; this section documents the edge cases the productised tokeniser must handle to satisfy them.

10.1 Stage B — heading-vs-line-shape priority for the ingredient-block boundary

The two segmentation strategies in §4 — explicit Ingredients heading vs. line-shape detection of the first ingredient line — can disagree when a source's PDF carries an Ingredients heading that appears AFTER the first method marker (Method, Step 1, etc.) due to layout-driven extraction order. BBC Good Food's carbonara PDF is the canonical example: pypdf.extract_text reproduces the visual block order, in which Ingredients prints at the bottom of page 1 AFTER the method heading and Step 1 banner have appeared.

Rule: the Ingredients heading marks the ingredient-block boundary IF AND ONLY IF it appears earlier in the extracted text than the first method marker. When the heading appears after the first method marker, treat the heading as a layout artefact (NOT a boundary) and fall back to line-shape detection of the first ingredient line.

This clarification preserves both interpretations of §4: the heading is preferred when present and unambiguously placed; the line-shape detector is the fallback. The corpus's §9 carbonara worked example resolves correctly under this rule.

10.2 Stage B — line-shape detector requires an explicit unit

Header content can sometimes match a permissive ingredient-line regex. Jamie Oliver's spatchcock chicken header reads 1 HR NOT TOO TRICKY SERVES 4 — three integers in non-ingredient context (a stylised time-and-difficulty banner). A line-shape detector that accepts <digit>+ <noun> would misclassify this as an ingredient line and pull 1 and 4 into the active-number sequence.

Rule: the line-shape ingredient detector REQUIRES an explicit metric or imperial unit on the line (kg, g, ml, l, cl, tbsp, tsp, oz, lb, pint, cm, in, mm). A line matching ^\s*<quantity>\s+<noun> without a unit does NOT count as an ingredient line under the line-shape fallback. Sources whose ingredient lines carry no unit (rare but possible — e.g. 3 large eggs if printed without a leading Ingredients heading) must publish the heading; line-shape alone cannot reliably distinguish these from header content.

The corpus's §9 carbonara and §11 self-test entries all satisfy this rule.

10.3 Stage C — bare-digit step trailers in the method-block

Some sources print step numbers as headings preceding each paragraph (Step 1, Step 2, …); others print them as trailers following each paragraph (a single digit on its own line, used as a paragraph-end marker). Jamie Oliver's spatchcock chicken uses the trailer form. Both forms are method-block structural scaffolding and contribute no active numbers; the §9 worked example confirms this implicitly (carbonara has 12 method steps, none of which contribute to the 17-token sequence).

Rule: within the method-block ONLY, lines containing exactly one or two digits with no other content are step-trailer markers and are stripped before Stage D tokenisation. The width restriction (1–2 digits) is deliberate: a quantity like 350 on its own line is part of the BBC Good Food stacked ingredient layout and must NOT be stripped.

The width restriction is sufficient for the corpus's range of recipe sizes (no recipe stores 100+ steps). Future sources with more steps would need either heading-form numbering (Step 100, caught by the existing _STEP_MARKER regex) or numbered-list form (100. <text>, caught by _NUMBERED_STEP_MARKER). Bare three- digit trailers are not currently anticipated; the spec can extend the rule if a real source emerges.

10.4 Step-marker stripping is method-block-scoped

The strip-step-markers operation runs on the method-block ONLY, not on the joined ingredient + method text. The distinction matters because the BBC Good Food stacked ingredient layout places quantities (100, 50, 350, etc.) on their own lines, and a global strip-bare-digit-trailers pass would consume those quantities.

Rule: the step-marker stripping (heading form, numbered-list form, and bare-digit trailer form) applies to the method-block segment as part of Stage C filtering. The ingredient-block segment is NOT subject to step-marker stripping; its bare-digit lines are preserved for Stage D tokenisation.

11. Tokeniser self-test corpus — source PDFs

This table records the tokeniser's expected output across nine source-recipe PDFs in /recipes/. It is a self-test for implementations of cookpit-active-number-sequence-v3.2.0, not a list of authored cooking files. An implementation that produces matching sequences for all nine MUST also produce matching hashes.

Of the nine source PDFs below, five have authenticated cooking files published at https://cookchow.com/recipes/3.2/: spaghetti_carbonara, perfect_boeuf_bourguignon, pork_three_ways (authored as pork-fillet-braised-cheeks-and-pork-belly), authentic_hungarian_goulash and roast-chicken-with-cider-and-sage. The remaining four (rhubarb_apple_crumble, fish_pie_cheese_mash, smoked_salmon_watercress_pate, lemon_pavlova, ham_hock_yellow_bean_sauce_chow_mein, salmon_pasta_bake) exist as source PDFs only and are retained here as tokeniser test coverage.

Source PDF	Sequence length	First 8 tokens
spaghetti_carbonara	17	`100-50-50-3-350-2-50-100`
rhubarb_apple_crumble	16	`450-3-350-3-1-120-200-1`
fish_pie_cheese_mash	53	`400-14-1-2-500-1-2-40`
smoked_salmon_watercress_pate	29	`100-3-5-4-150-5-5-350`
lemon_pavlova	26	`6-375-13-2-5-2-300-10`
perfect_boeuf_bourguignon	54	`1-6-3-8-4-5-200-7`
ham_hock_yellow_bean_sauce_chow_mein	48	`800-1-12-5-5-2-1-2`
salmon_pasta_bake	35	`750-1-25-120-4-5-50-1`
pork_three_ways	54	`4-9-45-1-4-1000-4-2`

For source PDFs with an active cooking file, the full sequence and hash strings are at cookpit.quantitativeFingerprint.{sequence, hash.value} in the file.

12. Conformance

A v3.2 file conforms to cookpit-active-number-sequence-v3.2.0 if and only if:

Its cookpit.quantitativeFingerprint.normalization field equals cookpit-active-number-sequence-v3.2.0.
Its sequence value matches the tokens produced by stages A-E applied to the source recipe.
Its hash.value is the lowercase SHA-256 hex digest of the sequence string per stage F.

The validator's V-FINGERPRINT-A checks shape (and self-consistency of hash against sequence). V-FINGERPRINT-B re-extracts the sequence from the source recipe per this profile and compares.

1. What the fingerprint is​

2. The pipeline​

3. STAGE A — text extraction​

4. STAGE B — segmentation​

5. STAGE C — filtering​

6. STAGE D — number tokenisation​

6.1 Patterns (priority order)​

6.2 Unicode fraction normalisation​

6.3 Cooking context​

6.4 Yield handling​

6.5 Source typo handling for numbers​

7. STAGE E — sequence rendering​

8. STAGE F — hashing​

9. Worked example: recipes/spaghetti_carbonara_recipe.pdf​

Stage A — extraction​

Stage B — segmentation​

Stage C — filtering​

Stage D — tokenisation​

Stage E — sequence rendering​

Stage F — hashing​

10. Worked clarifications surfaced by productisation​

10.1 Stage B — heading-vs-line-shape priority for the ingredient-block boundary​

10.2 Stage B — line-shape detector requires an explicit unit​

10.3 Stage C — bare-digit step trailers in the method-block​

10.4 Step-marker stripping is method-block-scoped​

11. Tokeniser self-test corpus — source PDFs​

12. Conformance​