Skip to main content

Cookpit v3.2 — Canonical Active-Number-Sequence Normalisation

The canonical normalisation referenced by rules.md K3 and K4 as cookpit-active-number-sequence-v3.2.0. This document publishes the exact tokenisation rules that turn a source recipe's text into the dash-separated active-number sequence stored at cookpit.quantitativeFingerprint.sequence.

Implementations MUST follow this profile exactly. Two independent implementations using this profile against the same source recipe MUST produce the byte-identical sequence string and therefore the byte-identical SHA-256 hash. The validator's V-FINGERPRINT-B hard criterion checks the sequence by recomputing it from the source text and comparing.

Stage scoping. This is the stage-1 source fingerprint. The AI Chef populates it at generation; the validator re-checks it at stage 2 as the source-faithfulness gate. It is computed from the source recipe, not from the cooking file. It is distinct from the file fingerprint at cookpit.attestation.fileFingerprint, which is the stage-3 file-content hash issued by the validator at attestation time and verified by stage-4 consumers. See rules.md A0.6 for the lifecycle separation between the two fingerprints.


1. What the fingerprint is

The strict quantitative fingerprint records every cooking-relevant number stated by the source recipe, in the order the source states it. The fingerprint exists so that:

  1. Two cooking files claiming to be derived from the same source produce the same fingerprint (a determinism check).
  2. A cooking file's source can be audited from its fingerprint (a content check).
  3. Library indexing can group recipes by source numeric skeleton regardless of tone, format, or chef-detective rewrite.

The fingerprint is NOT a hash of the file's planned tasks; it is a hash of the SOURCE recipe's stated numbers. The chef-detective's deductive expansion does not change the fingerprint.


2. The pipeline

source PDF / text

│ STAGE A — extraction

plain UTF-8 text

│ STAGE B — segmentation

ordered list of source segments (header / ingredient block / method / tips)

│ STAGE C — filtering

in-scope segments only (header excluded; tips excluded; chrome filtered)

│ STAGE D — number tokenisation

ordered list of active numbers

│ STAGE E — sequence rendering

dash-joined string `<n1>-<n2>-…-<nk>`

│ STAGE F — hashing

SHA-256 hex digest

Each stage is mechanically defined below.


3. STAGE A — text extraction

Implementations MAY extract source text from PDF, web page, plain text, or other carriers. Whatever the carrier, the extraction MUST produce a single UTF-8 string with the following normalisations:

  1. Ligatures expanded. PDF ligatures , , , , , , (U+FB00..U+FB06) are expanded to their component letters (fi, fl, ff, ffi, ffl, st, st).
  2. Curly quotes folded. ' (U+2018), ' (U+2019) → ' (U+0027); " (U+201C), " (U+201D) → " (U+0022). Source apostrophes become straight quotes for tokenisation; the original may be preserved in recipeInstructions[] schema.org pass-through.
  3. Soft hyphens removed. U+00AD soft hyphens are stripped; the text on either side is concatenated.
  4. Whitespace normalised. All Unicode whitespace classes collapse to single ASCII space (U+0020); leading/trailing whitespace per segment trimmed.
  5. Diacritics preserved. Characters with diacritics (pâté, Gruyère, purée) are NOT stripped. Only the active-number tokeniser ignores letters; diacritics in identifying text matter for stage B segmentation.

The extraction is a one-shot operation per source. Re-extraction must produce byte-identical results given the same input.


4. STAGE B — segmentation

The extracted text is partitioned into ordered segments of four kinds:

Segment kindDefinition
headerTitle block plus author, prep/cook/total/serve/dietary metadata
ingredient-blockThe recipe's bulleted or otherwise enumerated ingredient list
method-blockThe numbered or otherwise sequenced cooking instructions
tips-blockAny "Recipe tips", "Notes", "Variations" section after the method

Heuristics for segmentation:

  1. The header is everything from the start of the source to (but not including) the first ingredient line. Ingredient lines are identified by patterns like <quantity><unit> of <noun> or <quantity> <noun> matching the source's bulleted form.
  2. The ingredient-block ends at the first method-numbering pattern (Method, 1., Step 1, etc.) or at a bold/heading divider.
  3. The method-block ends at the first tips-style heading (Recipe tips, Notes, Variations, Tips) or at the end of the text.
  4. The tips-block includes everything after that heading until the end of the text.

Source-content categorisation (filtered in stage C) operates on segments individually, so segmentation accuracy matters.


5. STAGE C — filtering

Only in-scope segments contribute to the active-number sequence:

Segment kindIn scope?
headerNO — yield, prep time and cook time are metadata, not active cooking
ingredient-blockYES
method-blockYES
tips-blockNO — tips are out-of-band guidance, not active cooking

Within in-scope segments, additional filters:

  1. Sponsored content is filtered. Lines that read as advertising (e.g. BECOMEAMEMBER, Try our app, Finish Ultimate Plus, product placement) are removed before tokenisation. Implementations SHOULD maintain a configurable allowlist of brand-name patterns; the default list at v3.2.0 publication is:

    BECOMEAMEMBER
    Try our app
    You have <N> remaining read(s) today
    Finish Ultimate Plus

    Future revisions add patterns; the version cookpit-active- number-sequence-v3.2.0 freezes the v3.2.0 list above.

  2. Paywall chrome is filtered. Lines like You have two remaining reads today (and similar paywall markers from web-extracted PDFs) are removed.

  3. Source typos in numeric content are NOT silently corrected. If the source says 1000ml of pork stock twice in the ingredient block, both 1000 instances enter the sequence. The chef-detective corrects typos in tasks[].action text but the fingerprint sees the source's numbers as written.

  4. Source typos in non-numeric content (carbonara's "plump garlic", chow mein's "in the work") have no effect on the sequence — they are not numbers.


6. STAGE D — number tokenisation

Within filtered in-scope text, the tokeniser walks left-to-right and emits an active-number token for each match of one of the following patterns. Patterns are matched in priority order (top to bottom); a match consumes its source span and the walk resumes immediately after it.

6.1 Patterns (priority order)

#PatternTokenisationExample
1<int>g/<int>lb <int>ozthree tokens: int, lb-int, oz-int1.6kg/3lb 8oz1, 6, 3, 8
2<int>kg/<int>lb <int>ozthree tokens
3<int>(\.<int>)?(kg|g|ml|l|cl)/<int><frac>?(oz|fl oz|pint|in|cm)metric whole + fractional digits, then imperial whole + fractional digits40g/1½oz40, 1, 5
4<int><frac> (whole + Unicode fraction)two tokens: whole, fraction-digit1, 5
5<int>/<int> (literal fraction)two tokens: numerator, denominator1/41, 4
6<int>(\.<int>)two tokens: whole, decimal-fraction-digits1.51, 5
7<int>-<int> (range)two tokens: range-low, range-high2-3 minutes2, 3
8<int> (bare integer in cooking context)one token: int5 minutes5

6.2 Unicode fraction normalisation

SymbolTokenises as
¼25
½5
¾75
33
66
2
4
6
8
16
83
125
375
625
875

Rule: take the decimal expansion of the fraction, drop the leading 0., and emit the trailing digits without trailing zeros. The above table is exhaustive for fractions appearing in the source corpus; fractions outside this table are tokenised as <numerator>, <denominator> per pattern #5.

6.3 Cooking context

A bare integer (pattern #8) is tokenised as an active number IF and ONLY IF it sits within a cooking context. A cooking context is any of:

  • An ingredient line (in ingredient-block).
  • A method sentence in method-block.
  • A duration phrase (<n> minutes, <n> hours, <n> seconds).
  • A temperature phrase (<n>°C, <n>C, <n>°F, <n>F, gas mark <n>).
  • A multiplicity phrase (<n> batches, <n> times, <n>×, <n> sides).
  • A fraction-of phrase (a third, half the, quarter of, three-quarters).
  • A size phrase (<n>cm, <n>in).

Numbers in non-cooking context (e.g. 4.9 ratings, 27 ratings, Page 1 of 3, ISBN/SKU references) are not tokenised. Most page metadata appears in header or after the method, both already filtered in stage C.

6.4 Yield handling

Source's Serves N, Serves N-M, Makes N patterns are in the header segment and therefore filtered in stage C. Yield numbers do NOT enter the active-number sequence.

6.5 Source typo handling for numbers

Per stage C rule 3: source numeric typos are preserved verbatim. If the source has 100g/3½oz in the ingredient block AND 100g pancetta in the method (carbonara pattern), both 100s are tokenised — the fingerprint reflects the source's word count.


7. STAGE E — sequence rendering

The ordered list of tokens from stage D is rendered as a dash-separated string:

sequence = "<token1>-<token2>-…-<tokenN>"

Tokens are decimal integer strings, no leading zeros (except for literal zero, which is not expected to appear in the corpus — fractions handle the leading-zero cases).

The sequence string MUST match the regex ^[0-9]+(-[0-9]+)*$ (rule K4 / V-FINGERPRINT-A schema check).


8. STAGE F — hashing

The sequence string is encoded as UTF-8 (no BOM, no trailing newline) and hashed with SHA-256 (FIPS 180-4). The 64-character lowercase hex digest is the value of cookpit.quantitativeFingerprint.hash.value.


9. Worked example: recipes/spaghetti_carbonara_recipe.pdf

Source: Angela Nilsen, BBC Good Food, "Ultimate spaghetti carbonara".

Stage A — extraction

After ligature/quote/whitespace normalisation, the in-scope source text contains the ingredient block:

100g pancetta
50g pecorino cheese
50g parmesan
3 large eggs
350g spaghetti
2 plump garlic cloves, peeled and left whole
50g unsalted butter
sea salt and freshly ground black pepper

and the method block (Steps 1-12, abridged here for the parts that contribute numbers):

Step 2: Finely chop the 100g pancetta ... Finely grate 50g pecorino
cheese and 50g parmesan ...
Step 3: Beat the 3 large eggs ...
Step 4: Add 1 tsp salt to the boiling water, add 350g spaghetti and
when the water comes back to the boil, cook at a constant
simmer, covered, for 10 minutes or until al dente ...
Step 5: Squash 2 peeled plump garlic cloves ...
Step 6: ... Drop 50g unsalted butter into a large frying pan or wok ...
Step 7: Leave to cook on a medium heat for about 5 minutes ...

Stage B — segmentation

Segment kindLines
header"Ultimate spaghetti carbonara recipe / Angela Nilsen / Serves 4 Easy / Prep: 15 mins - 20 mins Cook: 15 mins / Discover how to make traditional…"
ingredient-blockthe 8 ingredient lines above
method-blockSteps 1-12
tips-block(none in this source)

Stage C — filtering

Header and tips removed. In-scope = ingredient-block + method-block.

Stage D — tokenisation

Source spanPatternTokens
100g pancetta (ingredient line)#8100
50g pecorino cheese (ingredient line)#850
50g parmesan (ingredient line)#850
3 large eggs (ingredient line)#83
350g spaghetti (ingredient line)#8350
2 plump garlic cloves (ingredient line)#82
50g unsalted butter (ingredient line)#850
(sea salt and pepper: no numbers)--
100g pancetta (Step 2 method)#8100
50g pecorino (Step 2 method)#850
50g parmesan (Step 2 method)#850
3 large eggs (Step 3 method)#83
1 tsp salt (Step 4 method)#81
350g spaghetti (Step 4 method)#8350
10 minutes (Step 4 method)#810
2 peeled plump garlic cloves (Step 5)#82
50g unsalted butter (Step 6 method)#850
5 minutes (Step 7 method)#85

Stage E — sequence rendering

100-50-50-3-350-2-50-100-50-50-3-1-350-10-2-50-5

17 tokens, matching the regex ^[0-9]+(-[0-9]+)*$.

Stage F — hashing

sha256("100-50-50-3-350-2-50-100-50-50-3-1-350-10-2-50-5") =
5d1ce74bfc00489e677ab7e321a818eade01ea64fe46d0aaf67d506f7d1ceda8

This matches the value at cookpit.quantitativeFingerprint.hash.value in the published spaghetti_carbonara.v3.2.cpt.A.jsonld example at https://cookchow.com/recipes/3.2/.


10. Worked clarifications surfaced by productisation

The following clarifications were surfaced when the executable implementation in scripts/lib/source_tokeniser.py was first run against the corpus. Each is a spec tightening that resolves an ambiguity in §3–§6 without changing the published intent. The §9 worked example and the §11 self-test corpus remain canonical; this section documents the edge cases the productised tokeniser must handle to satisfy them.

10.1 Stage B — heading-vs-line-shape priority for the ingredient-block boundary

The two segmentation strategies in §4 — explicit Ingredients heading vs. line-shape detection of the first ingredient line — can disagree when a source's PDF carries an Ingredients heading that appears AFTER the first method marker (Method, Step 1, etc.) due to layout-driven extraction order. BBC Good Food's carbonara PDF is the canonical example: pypdf.extract_text reproduces the visual block order, in which Ingredients prints at the bottom of page 1 AFTER the method heading and Step 1 banner have appeared.

Rule: the Ingredients heading marks the ingredient-block boundary IF AND ONLY IF it appears earlier in the extracted text than the first method marker. When the heading appears after the first method marker, treat the heading as a layout artefact (NOT a boundary) and fall back to line-shape detection of the first ingredient line.

This clarification preserves both interpretations of §4: the heading is preferred when present and unambiguously placed; the line-shape detector is the fallback. The corpus's §9 carbonara worked example resolves correctly under this rule.

10.2 Stage B — line-shape detector requires an explicit unit

Header content can sometimes match a permissive ingredient-line regex. Jamie Oliver's spatchcock chicken header reads 1 HR NOT TOO TRICKY SERVES 4 — three integers in non-ingredient context (a stylised time-and-difficulty banner). A line-shape detector that accepts <digit>+ <noun> would misclassify this as an ingredient line and pull 1 and 4 into the active-number sequence.

Rule: the line-shape ingredient detector REQUIRES an explicit metric or imperial unit on the line (kg, g, ml, l, cl, tbsp, tsp, oz, lb, pint, cm, in, mm). A line matching ^\s*<quantity>\s+<noun> without a unit does NOT count as an ingredient line under the line-shape fallback. Sources whose ingredient lines carry no unit (rare but possible — e.g. 3 large eggs if printed without a leading Ingredients heading) must publish the heading; line-shape alone cannot reliably distinguish these from header content.

The corpus's §9 carbonara and §11 self-test entries all satisfy this rule.

10.3 Stage C — bare-digit step trailers in the method-block

Some sources print step numbers as headings preceding each paragraph (Step 1, Step 2, …); others print them as trailers following each paragraph (a single digit on its own line, used as a paragraph-end marker). Jamie Oliver's spatchcock chicken uses the trailer form. Both forms are method-block structural scaffolding and contribute no active numbers; the §9 worked example confirms this implicitly (carbonara has 12 method steps, none of which contribute to the 17-token sequence).

Rule: within the method-block ONLY, lines containing exactly one or two digits with no other content are step-trailer markers and are stripped before Stage D tokenisation. The width restriction (1–2 digits) is deliberate: a quantity like 350 on its own line is part of the BBC Good Food stacked ingredient layout and must NOT be stripped.

The width restriction is sufficient for the corpus's range of recipe sizes (no recipe stores 100+ steps). Future sources with more steps would need either heading-form numbering (Step 100, caught by the existing _STEP_MARKER regex) or numbered-list form (100. <text>, caught by _NUMBERED_STEP_MARKER). Bare three- digit trailers are not currently anticipated; the spec can extend the rule if a real source emerges.

10.4 Step-marker stripping is method-block-scoped

The strip-step-markers operation runs on the method-block ONLY, not on the joined ingredient + method text. The distinction matters because the BBC Good Food stacked ingredient layout places quantities (100, 50, 350, etc.) on their own lines, and a global strip-bare-digit-trailers pass would consume those quantities.

Rule: the step-marker stripping (heading form, numbered-list form, and bare-digit trailer form) applies to the method-block segment as part of Stage C filtering. The ingredient-block segment is NOT subject to step-marker stripping; its bare-digit lines are preserved for Stage D tokenisation.


11. Tokeniser self-test corpus — source PDFs

This table records the tokeniser's expected output across nine source-recipe PDFs in /recipes/. It is a self-test for implementations of cookpit-active-number-sequence-v3.2.0, not a list of authored cooking files. An implementation that produces matching sequences for all nine MUST also produce matching hashes.

Of the nine source PDFs below, five have authenticated cooking files published at https://cookchow.com/recipes/3.2/: spaghetti_carbonara, perfect_boeuf_bourguignon, pork_three_ways (authored as pork-fillet-braised-cheeks-and-pork-belly), authentic_hungarian_goulash and roast-chicken-with-cider-and-sage. The remaining four (rhubarb_apple_crumble, fish_pie_cheese_mash, smoked_salmon_watercress_pate, lemon_pavlova, ham_hock_yellow_bean_sauce_chow_mein, salmon_pasta_bake) exist as source PDFs only and are retained here as tokeniser test coverage.

Source PDFSequence lengthFirst 8 tokens
spaghetti_carbonara17100-50-50-3-350-2-50-100
rhubarb_apple_crumble16450-3-350-3-1-120-200-1
fish_pie_cheese_mash53400-14-1-2-500-1-2-40
smoked_salmon_watercress_pate29100-3-5-4-150-5-5-350
lemon_pavlova266-375-13-2-5-2-300-10
perfect_boeuf_bourguignon541-6-3-8-4-5-200-7
ham_hock_yellow_bean_sauce_chow_mein48800-1-12-5-5-2-1-2
salmon_pasta_bake35750-1-25-120-4-5-50-1
pork_three_ways544-9-45-1-4-1000-4-2

For source PDFs with an active cooking file, the full sequence and hash strings are at cookpit.quantitativeFingerprint.{sequence, hash.value} in the file.


12. Conformance

A v3.2 file conforms to cookpit-active-number-sequence-v3.2.0 if and only if:

  1. Its cookpit.quantitativeFingerprint.normalization field equals cookpit-active-number-sequence-v3.2.0.
  2. Its sequence value matches the tokens produced by stages A-E applied to the source recipe.
  3. Its hash.value is the lowercase SHA-256 hex digest of the sequence string per stage F.

The validator's V-FINGERPRINT-A checks shape (and self-consistency of hash against sequence). V-FINGERPRINT-B re-extracts the sequence from the source recipe per this profile and compares.