Determinism for agent teams

Replay every answer. Prove every source.

FreshGeo logs the exact upstream fetch, parse and scoring chain behind every response. Re-run any call by cache_id and get byte-identical output — or a diff you can audit.

Every response ships with a cache_id

Every call to a FreshGeo API returns a cache_id in the response envelope. That ID is a content hash of the inputs, the upstream fetch manifest, the parser version and the scoring weights. Pass it back to /replay and you get the same typed JSON object your agent saw the first time — even if the live web has since changed. This is what makes evals actually stable across CI runs, not just on your laptop.

bash

curl https://api.freshgeo.com/v1/replay/ck_8f2a9b1e...

sources[] is not a citation, it is a receipt

Each typed field carries a sources array with the upstream URL, fetched_at timestamp, HTTP status, content hash and the CSS/XPath selector that produced the value. When legal asks where a price came from, you do not guess — you point at the exact DOM node on the exact archived snapshot. We keep the raw HTML for 30 days on Scale, 90 days on Enterprise, so regulators can see what the agent saw.

Deterministic parsers, versioned explicitly

Parsers have semantic versions. A new selector for price.currency bumps pricing@1.4.2 to pricing@1.5.0. Old cache_ids keep resolving against their original parser — pinned forever. You can opt into auto-upgrade per endpoint, or freeze a parser version across an entire agent key.

python

client = FreshGeo(api_key=key, parser_pins={'pricing': '1.4.2', 'intent': '2.0.1'})

Golden sets without the scraping tax

Point our CLI at a CSV of queries and it will snapshot the full response plus cache_ids into a fixtures directory. Commit the fixtures; your CI replays them on every PR. No flaky network, no rate limits, no 3am alerts because someone's site rearranged their DOM.

bash

freshgeo eval snapshot --input queries.csv --out tests/fixtures/
freshgeo eval replay tests/fixtures/ --assert-schema

Diffs are first-class

Replay a cache_id against the current live parser and you get a structured diff — which fields changed, which sources moved, which selectors broke. CI can fail a build when a diff crosses a threshold you set (say, more than 5% of queries changed price.amount). One customer caught a competitor silently A/B testing prices to bots four hours before their pricing agent would have mispriced 18,000 SKUs.

Per-agent keys with hard caps

Every agent gets its own key with a hard daily spend cap, a per-endpoint RPS limit and an allowlist of which of the 7 APIs it can call. A runaway loop hits the cap and 429s; it does not drain your monthly budget at 2am.

What we do not do

We do not return free-text blobs and call it structured. We do not paper over upstream failures with a plausible-sounding hallucination. If a source is down, the field comes back null with a sources[].status of 503 and a stale_from_cache flag if we served a previous value. Your agent decides whether stale is acceptable — we just tell you the truth about where the data came from.

FAQ

Questions eval teams ask

How long are cache_ids valid?+

Forever on Enterprise, 12 months on Scale, 30 days on Starter. Replay always works within the window, even after parser upgrades.

Do replays count against my quota?+

Replays are billed at 10% of a live call. Most eval-heavy customers spend under 4% of total usage on replays.

Can I export the raw HTML snapshot?+

Yes, on Scale and above. GET /v1/snapshots/{cache_id} returns a signed URL to the gzipped HTML plus the response headers we received.

What happens when a parser version is deprecated?+

We email 90 days ahead, keep the parser serving replays for another 12 months, then freeze to a static artefact you can run locally via our open-source replay binary.

Is this SOC 2 compliant?+

SOC 2 Type II is in progress, report expected Q3. Type I is complete; happy to share under NDA.

How is this different from LangSmith or Langfuse?+

Those trace your agent. We make the underlying data calls themselves replayable. Use both — the trace tells you what your agent did, our cache_ids tell you whether the world actually looked like that.