Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

We were brainstorming what to build with our LangChain + CatchAll integration when the Winter Olympics came up. We were already following the Games and bantering about sports during meetings and chats, but something felt off.

The results were easy to find. Everything else wasn't. Incidents, updates, and context were scattered across the web. You could see what happened briefly or was reported, but not why it mattered or how events connected.

That was our eureka. We decided to build a dashboard that brings everything together into something we can actually query, explore, and understand, making it much easier to follow and analyse the Games beyond just the scoreboard.

The Backend Pipeline: How the System Works

The backend converts scattered web coverage into structured incident intelligence and medal tracking. Instead of browsing articles, we can answer operational questions in seconds: what incidents are happening right now, which ones are high impact, who is affected, and what are the effect medal standings.

The pipeline has five stages.

Collect: ingests Olympics coverage via Langchain-CatchAll on-demand or on a schedule.
Extract: uses schema-based LLM extraction to pull structured incidents and medal facts from raw text.
Reconcile: merges duplicate reports into canonical records, keeping only the highest-confidence version with all supporting sources attached.
Aggregate: compiles cleaned data into optimised tables for leaderboards and trend charts.
Serve and alert: exposes API endpoints and generates notifications for rank changes, first gold medals, and high-impact incident spikes.

Each stage maps to a single service file, so it can be run, tested, and scaled independently.

Step 1: Set Up Environment Variables

You need access to two services: CatchAll for data discovery and OpenRouter for structured extraction. Create a .env file in your project root with your API keys and the following config:

CATCHALL_API_KEY="your_key_here"

OPENROUTER_API_KEY="your_key_here"
OPENROUTER_MODEL=openai/gpt-4o-mini
OPENROUTER_ROUTER_MODEL=openai/gpt-4o-mini
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_TIMEOUT_SECONDS=90
OPENROUTER_MAX_RETRIES=2
OPENROUTER_REQUIRE_PARAMETERS=true
SITE_URL=http://localhost:3000
SITE_NAME=Olympics2026Dashboard
POLL_INTERVAL_SECONDS=30
MAX_WAIT_SECONDS=2400
EXTRACTION_MAX_EXCERPT_CHARS=5000
EXTRACTION_MIN_CONFIDENCE=0.45
ENABLE_INTERNAL_SCHEDULER=true
SCHEDULER_FULL_CYCLE_TIME_1=00:00
SCHEDULER_FULL_CYCLE_TIME_2=18:00
API_HOST=0.0.0.0
API_PORT=8000
FRONTEND_ORIGIN=http://localhost:3000
DATA_DIR=data

CatchAll limits calls per second and total quota depending on your plan, so bake in a configurable pause between page calls. EXTRACTION_MIN_CONFIDENCE controls how strict the LLM extraction is. 0.45 is a reasonable starting point that balances recall against noise.

Step 2: Storage and Data Layout

The pipeline is storage-first. Every stage writes to disk in a predictable structure across four layers: raw/ for append-only collection logs, clean/ for canonical deduplicated records, agg/ for precomputed dashboard tables, and index/ for incremental processing state.

At startup, ensure_storage() creates all required folders and CSV files with the correct headers. This matters because your API routes read these files as their source of truth; treating headers as part of the contract keeps the whole pipeline predictable and easy to debug.

Step 3: Collect with CatchAll

Instead of building custom scrapers for each site, we use langchain-catchall to query the web and receive structured records directly. This lets us scale coverage across thousands of sources without individual site integrations.

The query design matters here. We run two sets of targeted searches — one for medal results, one for incidents, written specifically to pull different types of coverage:

Medal queries target official results tables, podium finishes by sport, and daily results by date. Incident queries target disqualifications, appeals, doping violations, eligibility rulings, injuries, and records broken.

MEDAL_QUERIES = [
    "Milano Cortina 2026 Winter Olympics official medal table gold silver bronze by country",
    "Milano Cortina 2026 Winter Olympics medal winners by event and discipline",
    ...
]

INCIDENT_QUERIES = [
    "Milano Cortina 2026 Winter Olympics disqualification disqualified athlete coach team",
    "Milano Cortina 2026 Winter Olympics doping ban suspension anti-doping violation",
    ...
]

Every record is normalized into a consistent structure, title, URL, excerpt, and raw before storage. URLs are normalized so the same article with different formatting doesn't slip through as a new record later.

Before extraction: deduplicate by URL

Before passing anything to the LLM, filter out URLs you've already processed. Extraction costs money; there's no reason to pay twice for the same article.

We maintain data/index/seen_urls.csv as a persistent index. New URLs get logged with a timestamp and move forward to extraction. Repeated URLs increment a hit counter and get skipped. This runs before every extraction cycle, so the API quota only goes toward genuinely new coverage.

Step 4: Extract and Structure with Schema-Based Extraction

Raw records contain useful information, but they're unstructured. A record might mention "athlete disqualified" without telling us the sport, country, or severity. This step fixes that.

Each record goes through two LLM passes. The balanced pass extracts all explicit facts and optimises for recall. If that fails, the strict pass kicks in — it drops anything uncertain and optimises for precision. If both fail, the record goes to quarantine.

Every accepted incident ends up with: incident_type, sport, affected_entity_name, country_code, impact_level, confidence, and source_url. Medal rows require event, sport, winner_name, and a valid IOC country code. Anything below the confidence threshold or missing required fields is rejected.

The JSON schemas in app/schemas/ are doing more work than they appear to. Every extraction call is validated against them before anything touches disk. If the model returns a row that doesn't match the schema, it's rejected at the schema level, not by application logic. This keeps the downstream pipeline clean by design.

Records that fail both passes don't get deleted. They go to data/raw/quarantine_incidents.jsonl and data/raw/quarantine_medals.jsonl for review. This gives you a full audit trail and a way to improve your prompts over time based on what the model struggled with.

Step 5: Reconcile and Deduplicate

Multiple publishers often cover the same incident or medal. Without reconciliation, the same event appears dozens of times with slightly different wording.

Each incident gets a deterministic ID hashed from its core fields: games, incident type, sport, event, affected entity, and date. The same fields always produce the same ID regardless of which publisher reported it. Medal records hash on games, sport, event, medal type, winner name, and country code.

When a new record arrives, the reconciler checks if a canonical version already exists. It only replaces it if the new record has equal or higher confidence. Either way, the source is always logged. The result is one row per canonical incident in clean/incidents_2026.csv, with every reporting outlet linked in a separate sources table. Every incident card on the dashboard can show exactly which outlets reported it and when.

Step 6: Aggregate

Once medals are reconciled, we precompute three tables so the UI never processes raw data on the fly. agg/tally_country.csv is the main leaderboard, ranked by gold then silver then bronze, following the official Olympic tiebreaker. agg/tally_country_sport.csv breaks medals down by country and sport. agg/tally_daily.csv tracks medals per country per day for trend charts.

rebuild_aggregates runs at the end of every full cycle and can also be called independently via run_refresh_all — so you can recalculate leaderboards from existing clean data without hitting the API again.

Step 7: Serve, Alert, and Chat

Notifications fire automatically after every cycle. Three things trigger an alert: a country changing rank, a country winning its first gold, or a country gaining two or more medals since the last refresh. High-impact incidents and sensitive types — disqualifications, doping, appeals — generate their own alerts regardless of medal movement. Every notification gets a deterministic ID so duplicates never accumulate.

The chat interface is the most interesting part. Instead of rigid query filters, chat_router.py routes plain English questions to the right data handler using a two-step LLM pipeline. The first call classifies intent — tally, event lookup, incident lookup, research, or small talk. The second call rewrites the structured response into clean markdown with follow-up questions. If the model router fails, a lightweight fallback reads intent from context and keyword patterns so the chat never goes dark.

All LLM calls go through openrouter_structured.py, which validates every response against a JSON schema before it touches the application. Temperature is set to 0 across all extraction and routing calls — you want deterministic outputs, not creative ones.

Step 8: Putting it all Together

Each service works independently. pipeline.py is the conductor.

run_medals_cycle runs the full medals flow: collect → dedupe → extract → reconcile → aggregate → notify. run_incidents_cycle does the same for incidents. run_refresh_all skips collection entirely and just rebuilds aggregates and notifications from existing clean data. run_full_cycle calls all three in sequence — this is what the scheduler triggers at midnight and 18:00 daily.

Every cycle returns a stats dict: records collected, new, skipped, extracted, quarantined, and merged. The whole pipeline is observable, not a black box.

The full code is available on GitHub. Clone it, add your .env, and run python jobs/run_full_cycle.py to kick off your first cycle. The frontend connects to the same API and is in the same repository.

Also interesting

all articles

Black thin grid lines forming diamond-shaped pattern on a white background.

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

NewsCatcher

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

NewsCatcher

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

NewsCatcher

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing Programme Director

Company

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh Head Of Product

Product

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Tutorial

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh Head Of Product

Also interesting

all articles

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

NewsCatcher

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

NewsCatcher

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

NewsCatcher

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing

Programme Director

Company

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh

Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova

Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara

CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan

Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova

Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh

Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh

Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh

Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara

CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara

CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi

CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh

Head Of Product

Product

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh

Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh

Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Tutorial

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara

CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara

CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara

CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh

Head Of Product