Cookpit v3.2 — Canonical Active-Number-Sequence Normalisation
The canonical normalisation referenced by
rules.mdK3 and K4 ascookpit-active-number-sequence-v3.2.0. This document publishes the exact tokenisation rules that turn a source recipe's text into the dash-separated active-number sequence stored atcookpit.quantitativeFingerprint.sequence.Implementations MUST follow this profile exactly. Two independent implementations using this profile against the same source recipe MUST produce the byte-identical sequence string and therefore the byte-identical SHA-256 hash. The validator's
V-FINGERPRINT-Bhard criterion checks the sequence by recomputing it from the source text and comparing.Stage scoping. This is the stage-1 source fingerprint. The AI Chef populates it at generation; the validator re-checks it at stage 2 as the source-faithfulness gate. It is computed from the source recipe, not from the cooking file. It is distinct from the file fingerprint at
cookpit.attestation.fileFingerprint, which is the stage-3 file-content hash issued by the validator at attestation time and verified by stage-4 consumers. Seerules.mdA0.6 for the lifecycle separation between the two fingerprints.
1. What the fingerprint is
The strict quantitative fingerprint records every cooking-relevant number stated by the source recipe, in the order the source states it. The fingerprint exists so that:
- Two cooking files claiming to be derived from the same source produce the same fingerprint (a determinism check).
- A cooking file's source can be audited from its fingerprint (a content check).
- Library indexing can group recipes by source numeric skeleton regardless of tone, format, or chef-detective rewrite.
The fingerprint is NOT a hash of the file's planned tasks; it is a hash of the SOURCE recipe's stated numbers. The chef-detective's deductive expansion does not change the fingerprint.
2. The pipeline
source PDF / text
│
│ STAGE A — extraction
▼
plain UTF-8 text
│
│ STAGE B — segmentation
▼
ordered list of source segments (header / ingredient block / method / tips)
│
│ STAGE C — filtering
▼
in-scope segments only (header excluded; tips excluded; chrome filtered)
│
│ STAGE D — number tokenisation
▼
ordered list of active numbers
│
│ STAGE E — sequence rendering
▼
dash-joined string `<n1>-<n2>-…-<nk>`
│
│ STAGE F — hashing
▼
SHA-256 hex digest
Each stage is mechanically defined below.
3. STAGE A — text extraction
Implementations MAY extract source text from PDF, web page, plain text, or other carriers. Whatever the carrier, the extraction MUST produce a single UTF-8 string with the following normalisations:
- Ligatures expanded. PDF ligatures
fi,fl,ff,ffi,ffl,ſt,st(U+FB00..U+FB06) are expanded to their component letters (fi,fl,ff,ffi,ffl,st,st). - Curly quotes folded.
'(U+2018),'(U+2019) →'(U+0027);"(U+201C),"(U+201D) →"(U+0022). Source apostrophes become straight quotes for tokenisation; the original may be preserved inrecipeInstructions[]schema.org pass-through. - Soft hyphens removed. U+00AD soft hyphens are stripped; the text on either side is concatenated.
- Whitespace normalised. All Unicode whitespace classes collapse to single ASCII space (U+0020); leading/trailing whitespace per segment trimmed.
- Diacritics preserved. Characters with diacritics (
pâté,Gruyère,purée) are NOT stripped. Only the active-number tokeniser ignores letters; diacritics in identifying text matter for stage B segmentation.
The extraction is a one-shot operation per source. Re-extraction must produce byte-identical results given the same input.
4. STAGE B — segmentation
The extracted text is partitioned into ordered segments of four kinds:
| Segment kind | Definition |
|---|---|
header | Title block plus author, prep/cook/total/serve/dietary metadata |
ingredient-block | The recipe's bulleted or otherwise enumerated ingredient list |
method-block | The numbered or otherwise sequenced cooking instructions |
tips-block | Any "Recipe tips", "Notes", "Variations" section after the method |
Heuristics for segmentation:
- The
headeris everything from the start of the source to (but not including) the first ingredient line. Ingredient lines are identified by patterns like<quantity><unit> of <noun>or<quantity> <noun>matching the source's bulleted form. - The
ingredient-blockends at the first method-numbering pattern (Method,1.,Step 1, etc.) or at a bold/heading divider. - The
method-blockends at the first tips-style heading (Recipe tips,Notes,Variations,Tips) or at the end of the text. - The
tips-blockincludes everything after that heading until the end of the text.
Source-content categorisation (filtered in stage C) operates on segments individually, so segmentation accuracy matters.
5. STAGE C — filtering
Only in-scope segments contribute to the active-number sequence:
| Segment kind | In scope? |
|---|---|
header | NO — yield, prep time and cook time are metadata, not active cooking |
ingredient-block | YES |
method-block | YES |
tips-block | NO — tips are out-of-band guidance, not active cooking |
Within in-scope segments, additional filters:
-
Sponsored content is filtered. Lines that read as advertising (e.g.
BECOMEAMEMBER,Try our app,Finish Ultimate Plus, product placement) are removed before tokenisation. Implementations SHOULD maintain a configurable allowlist of brand-name patterns; the default list at v3.2.0 publication is:BECOMEAMEMBERTry our appYou have <N> remaining read(s) todayFinish Ultimate PlusFuture revisions add patterns; the version
cookpit-active- number-sequence-v3.2.0freezes the v3.2.0 list above. -
Paywall chrome is filtered. Lines like
You have two remaining reads today(and similar paywall markers from web-extracted PDFs) are removed. -
Source typos in numeric content are NOT silently corrected. If the source says
1000ml of pork stocktwice in the ingredient block, both1000instances enter the sequence. The chef-detective corrects typos intasks[].actiontext but the fingerprint sees the source's numbers as written. -
Source typos in non-numeric content (carbonara's "plump garlic", chow mein's "in the work") have no effect on the sequence — they are not numbers.
6. STAGE D — number tokenisation
Within filtered in-scope text, the tokeniser walks left-to-right and emits an active-number token for each match of one of the following patterns. Patterns are matched in priority order (top to bottom); a match consumes its source span and the walk resumes immediately after it.
6.1 Patterns (priority order)
| # | Pattern | Tokenisation | Example |
|---|---|---|---|
| 1 | <int>g/<int>lb <int>oz | three tokens: int, lb-int, oz-int | 1.6kg/3lb 8oz → 1, 6, 3, 8 |
| 2 | <int>kg/<int>lb <int>oz | three tokens | |
| 3 | <int>(\.<int>)?(kg|g|ml|l|cl)/<int><frac>?(oz|fl oz|pint|in|cm) | metric whole + fractional digits, then imperial whole + fractional digits | 40g/1½oz → 40, 1, 5 |
| 4 | <int><frac> (whole + Unicode fraction) | two tokens: whole, fraction-digit | 1½ → 1, 5 |
| 5 | <int>/<int> (literal fraction) | two tokens: numerator, denominator | 1/4 → 1, 4 |
| 6 | <int>(\.<int>) | two tokens: whole, decimal-fraction-digits | 1.5 → 1, 5 |
| 7 | <int>-<int> (range) | two tokens: range-low, range-high | 2-3 minutes → 2, 3 |
| 8 | <int> (bare integer in cooking context) | one token: int | 5 minutes → 5 |
6.2 Unicode fraction normalisation
| Symbol | Tokenises as |
|---|---|
¼ | 25 |
½ | 5 |
¾ | 75 |
⅓ | 33 |
⅔ | 66 |
⅕ | 2 |
⅖ | 4 |
⅗ | 6 |
⅘ | 8 |
⅙ | 16 |
⅚ | 83 |
⅛ | 125 |
⅜ | 375 |
⅝ | 625 |
⅞ | 875 |
Rule: take the decimal expansion of the fraction, drop the leading
0., and emit the trailing digits without trailing zeros. The above
table is exhaustive for fractions appearing in the source corpus;
fractions outside this table are tokenised as <numerator>,
<denominator> per pattern #5.
6.3 Cooking context
A bare integer (pattern #8) is tokenised as an active number IF and ONLY IF it sits within a cooking context. A cooking context is any of:
- An ingredient line (in
ingredient-block). - A method sentence in
method-block. - A duration phrase (
<n> minutes,<n> hours,<n> seconds). - A temperature phrase (
<n>°C,<n>C,<n>°F,<n>F,gas mark <n>). - A multiplicity phrase (
<n> batches,<n> times,<n>×,<n> sides). - A fraction-of phrase (
a third,half the,quarter of,three-quarters). - A size phrase (
<n>cm,<n>in).
Numbers in non-cooking context (e.g. 4.9 ratings, 27 ratings,
Page 1 of 3, ISBN/SKU references) are not tokenised. Most page
metadata appears in header or after the method, both already
filtered in stage C.
6.4 Yield handling
Source's Serves N, Serves N-M, Makes N patterns are in the
header segment and therefore filtered in stage C. Yield numbers
do NOT enter the active-number sequence.
6.5 Source typo handling for numbers
Per stage C rule 3: source numeric typos are preserved verbatim. If
the source has 100g/3½oz in the ingredient block AND 100g pancetta in the method (carbonara pattern), both 100s are
tokenised — the fingerprint reflects the source's word count.
7. STAGE E — sequence rendering
The ordered list of tokens from stage D is rendered as a dash-separated string:
sequence = "<token1>-<token2>-…-<tokenN>"
Tokens are decimal integer strings, no leading zeros (except for literal zero, which is not expected to appear in the corpus — fractions handle the leading-zero cases).
The sequence string MUST match the regex ^[0-9]+(-[0-9]+)*$
(rule K4 / V-FINGERPRINT-A schema check).
8. STAGE F — hashing
The sequence string is encoded as UTF-8 (no BOM, no trailing newline)
and hashed with SHA-256 (FIPS 180-4). The 64-character lowercase
hex digest is the value of cookpit.quantitativeFingerprint.hash.value.
9. Worked example: recipes/spaghetti_carbonara_recipe.pdf
Source: Angela Nilsen, BBC Good Food, "Ultimate spaghetti carbonara".
Stage A — extraction
After ligature/quote/whitespace normalisation, the in-scope source text contains the ingredient block:
100g pancetta
50g pecorino cheese
50g parmesan
3 large eggs
350g spaghetti
2 plump garlic cloves, peeled and left whole
50g unsalted butter
sea salt and freshly ground black pepper
and the method block (Steps 1-12, abridged here for the parts that contribute numbers):
Step 2: Finely chop the 100g pancetta ... Finely grate 50g pecorino
cheese and 50g parmesan ...
Step 3: Beat the 3 large eggs ...
Step 4: Add 1 tsp salt to the boiling water, add 350g spaghetti and
when the water comes back to the boil, cook at a constant
simmer, covered, for 10 minutes or until al dente ...
Step 5: Squash 2 peeled plump garlic cloves ...
Step 6: ... Drop 50g unsalted butter into a large frying pan or wok ...
Step 7: Leave to cook on a medium heat for about 5 minutes ...
Stage B — segmentation
| Segment kind | Lines |
|---|---|
| header | "Ultimate spaghetti carbonara recipe / Angela Nilsen / Serves 4 Easy / Prep: 15 mins - 20 mins Cook: 15 mins / Discover how to make traditional…" |
| ingredient-block | the 8 ingredient lines above |
| method-block | Steps 1-12 |
| tips-block | (none in this source) |
Stage C — filtering
Header and tips removed. In-scope = ingredient-block + method-block.
Stage D — tokenisation
| Source span | Pattern | Tokens |
|---|---|---|
100g pancetta (ingredient line) | #8 | 100 |
50g pecorino cheese (ingredient line) | #8 | 50 |
50g parmesan (ingredient line) | #8 | 50 |
3 large eggs (ingredient line) | #8 | 3 |
350g spaghetti (ingredient line) | #8 | 350 |
2 plump garlic cloves (ingredient line) | #8 | 2 |
50g unsalted butter (ingredient line) | #8 | 50 |
| (sea salt and pepper: no numbers) | - | - |
100g pancetta (Step 2 method) | #8 | 100 |
50g pecorino (Step 2 method) | #8 | 50 |
50g parmesan (Step 2 method) | #8 | 50 |
3 large eggs (Step 3 method) | #8 | 3 |
1 tsp salt (Step 4 method) | #8 | 1 |
350g spaghetti (Step 4 method) | #8 | 350 |
10 minutes (Step 4 method) | #8 | 10 |
2 peeled plump garlic cloves (Step 5) | #8 | 2 |
50g unsalted butter (Step 6 method) | #8 | 50 |
5 minutes (Step 7 method) | #8 | 5 |
Stage E — sequence rendering
100-50-50-3-350-2-50-100-50-50-3-1-350-10-2-50-5
17 tokens, matching the regex ^[0-9]+(-[0-9]+)*$.
Stage F — hashing
sha256("100-50-50-3-350-2-50-100-50-50-3-1-350-10-2-50-5") =
5d1ce74bfc00489e677ab7e321a818eade01ea64fe46d0aaf67d506f7d1ceda8
This matches the value at
cookpit.quantitativeFingerprint.hash.value in the published
spaghetti_carbonara.v3.2.cpt.A.jsonld example at
https://cookchow.com/recipes/3.2/.
10. Worked clarifications surfaced by productisation
The following clarifications were surfaced when the executable
implementation in scripts/lib/source_tokeniser.py was first run
against the corpus. Each is a spec tightening that resolves an
ambiguity in §3–§6 without changing the published intent. The §9
worked example and the §11 self-test corpus remain canonical; this
section documents the edge cases the productised tokeniser must
handle to satisfy them.
10.1 Stage B — heading-vs-line-shape priority for the ingredient-block boundary
The two segmentation strategies in §4 — explicit Ingredients
heading vs. line-shape detection of the first ingredient line — can
disagree when a source's PDF carries an Ingredients heading that
appears AFTER the first method marker (Method, Step 1, etc.) due
to layout-driven extraction order. BBC Good Food's carbonara PDF is
the canonical example: pypdf.extract_text reproduces the visual
block order, in which Ingredients prints at the bottom of page 1
AFTER the method heading and Step 1 banner have appeared.
Rule: the Ingredients heading marks the ingredient-block
boundary IF AND ONLY IF it appears earlier in the extracted text
than the first method marker. When the heading appears after the
first method marker, treat the heading as a layout artefact (NOT a
boundary) and fall back to line-shape detection of the first
ingredient line.
This clarification preserves both interpretations of §4: the heading is preferred when present and unambiguously placed; the line-shape detector is the fallback. The corpus's §9 carbonara worked example resolves correctly under this rule.
10.2 Stage B — line-shape detector requires an explicit unit
Header content can sometimes match a permissive ingredient-line
regex. Jamie Oliver's spatchcock chicken header reads
1 HR NOT TOO TRICKY SERVES 4 — three integers in non-ingredient
context (a stylised time-and-difficulty banner). A line-shape
detector that accepts <digit>+ <noun> would misclassify this as
an ingredient line and pull 1 and 4 into the active-number
sequence.
Rule: the line-shape ingredient detector REQUIRES an explicit
metric or imperial unit on the line (kg, g, ml, l, cl,
tbsp, tsp, oz, lb, pint, cm, in, mm). A line
matching ^\s*<quantity>\s+<noun> without a unit does NOT count
as an ingredient line under the line-shape fallback. Sources whose
ingredient lines carry no unit (rare but possible — e.g. 3 large eggs if printed without a leading Ingredients heading) must
publish the heading; line-shape alone cannot reliably distinguish
these from header content.
The corpus's §9 carbonara and §11 self-test entries all satisfy this rule.
10.3 Stage C — bare-digit step trailers in the method-block
Some sources print step numbers as headings preceding each
paragraph (Step 1, Step 2, …); others print them as trailers
following each paragraph (a single digit on its own line, used as
a paragraph-end marker). Jamie Oliver's spatchcock chicken uses
the trailer form. Both forms are method-block structural
scaffolding and contribute no active numbers; the §9 worked
example confirms this implicitly (carbonara has 12 method steps,
none of which contribute to the 17-token sequence).
Rule: within the method-block ONLY, lines containing exactly
one or two digits with no other content are step-trailer markers
and are stripped before Stage D tokenisation. The width
restriction (1–2 digits) is deliberate: a quantity like 350 on
its own line is part of the BBC Good Food stacked ingredient
layout and must NOT be stripped.
The width restriction is sufficient for the corpus's range of
recipe sizes (no recipe stores 100+ steps). Future sources with
more steps would need either heading-form numbering (Step 100,
caught by the existing _STEP_MARKER regex) or numbered-list form
(100. <text>, caught by _NUMBERED_STEP_MARKER). Bare three-
digit trailers are not currently anticipated; the spec can extend
the rule if a real source emerges.
10.4 Step-marker stripping is method-block-scoped
The strip-step-markers operation runs on the method-block ONLY,
not on the joined ingredient + method text. The distinction
matters because the BBC Good Food stacked ingredient layout
places quantities (100, 50, 350, etc.) on their own lines,
and a global strip-bare-digit-trailers pass would consume those
quantities.
Rule: the step-marker stripping (heading form, numbered-list form, and bare-digit trailer form) applies to the method-block segment as part of Stage C filtering. The ingredient-block segment is NOT subject to step-marker stripping; its bare-digit lines are preserved for Stage D tokenisation.
11. Tokeniser self-test corpus — source PDFs
This table records the tokeniser's expected output across nine
source-recipe PDFs in /recipes/. It is a self-test for
implementations of cookpit-active-number-sequence-v3.2.0, not a
list of authored cooking files. An implementation that produces
matching sequences for all nine MUST also produce matching hashes.
Of the nine source PDFs below, five have authenticated cooking
files published at https://cookchow.com/recipes/3.2/: spaghetti_carbonara,
perfect_boeuf_bourguignon, pork_three_ways (authored as
pork-fillet-braised-cheeks-and-pork-belly),
authentic_hungarian_goulash and roast-chicken-with-cider-and-sage.
The remaining four (rhubarb_apple_crumble, fish_pie_cheese_mash,
smoked_salmon_watercress_pate, lemon_pavlova,
ham_hock_yellow_bean_sauce_chow_mein, salmon_pasta_bake) exist
as source PDFs only and are retained here as tokeniser test
coverage.
| Source PDF | Sequence length | First 8 tokens |
|---|---|---|
| spaghetti_carbonara | 17 | 100-50-50-3-350-2-50-100 |
| rhubarb_apple_crumble | 16 | 450-3-350-3-1-120-200-1 |
| fish_pie_cheese_mash | 53 | 400-14-1-2-500-1-2-40 |
| smoked_salmon_watercress_pate | 29 | 100-3-5-4-150-5-5-350 |
| lemon_pavlova | 26 | 6-375-13-2-5-2-300-10 |
| perfect_boeuf_bourguignon | 54 | 1-6-3-8-4-5-200-7 |
| ham_hock_yellow_bean_sauce_chow_mein | 48 | 800-1-12-5-5-2-1-2 |
| salmon_pasta_bake | 35 | 750-1-25-120-4-5-50-1 |
| pork_three_ways | 54 | 4-9-45-1-4-1000-4-2 |
For source PDFs with an active cooking file, the full sequence and
hash strings are at cookpit.quantitativeFingerprint.{sequence, hash.value} in the file.
12. Conformance
A v3.2 file conforms to cookpit-active-number-sequence-v3.2.0 if
and only if:
- Its
cookpit.quantitativeFingerprint.normalizationfield equalscookpit-active-number-sequence-v3.2.0. - Its
sequencevalue matches the tokens produced by stages A-E applied to the source recipe. - Its
hash.valueis the lowercase SHA-256 hex digest of the sequence string per stage F.
The validator's V-FINGERPRINT-A checks shape (and self-consistency
of hash against sequence). V-FINGERPRINT-B re-extracts the
sequence from the source recipe per this profile and compares.