Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara

CEO & co-founder

Introduction

Turning raw web search results into clean, machine-readable data is the hardest challenge in building AI pipelines today. Every time AI agents, LLM applications, and automation workflows pull results from the web, they get a mix of HTML fragments, inconsistent formatting, missing fields, and noise.

They struggle with this raw text because downstream software relies on deterministic rules. When an LLM processes raw HTML, its attention mechanism can easily mistake structural tags, hidden CSS elements, or ad injections for actual page content. Furthermore, retrieval of incorrect documents can lead to missing fields and fabricated values that no amount of prompt tuning can fix.

Getting reliable structured data extraction requires automated pipelines that convert unstructured web content into standardized, machine-readable JSON. Such a structured web data extraction pipeline depends on accurate JSON schemas, validation prompts, and retrieval quality.

How Does Structured Data Extraction from Web Search Results Work?

Structured web data extraction is the process of searching the internet for specific information and organizing it into a clear format, such as JSON. The pipeline follows these steps:

Document Retrieval: The process begins with a search system that uses a web search API, such as CatchAll by NewsCatcher, to retrieve unstructured documents from the open web.
Schema Injection and Parsing: The system then feeds this text into the context window of a language model, along with a predefined structure, like a JSON schema. The model acts as a context-aware parser and identifies key entities, numerical figures, and semantic relationships.
Constrained Generation: Once the model processes the raw text, it formats the extracted facts to comply with the structured outputs protocol (grammar constraints like token masking) enforced during the token generation stage.

This schema-guided workflow transforms unstructured web content into usable data that can be easily integrated into downstream applications..

‍

A structured data extraction pipeline showing web search API retrieval, raw text injection into an LLM context window, schema-guided token filtering, and the final structured output. | Source

How Do JSON Schemas Improve Structured Data Extraction?

JSON schemas improve structured data extraction by acting as formatting templates. They prevent the language model from guessing how to organize its output. Instead, a schema tells the model what information to extract and what data type, like a string, integer, or boolean, to use.

A schema also explicitly defines required vs. optional fields. A required field forces the model to either find the value or flag it as missing, which prevents your downstream applications from crashing due to a missing key. Marking a field as optional gives the model room to capture extra details if they exist in the source text, without breaking the entire extraction if they are missing.

In modern development work, teams handle semi-structured data extraction with rigid JSON blueprints. Among the many JSON schemas examples used in practice, the one below shows an extraction schema for corporate acquisitions:

{
  "type": "object",
  "properties": {
    "acquirer_name": {
      "type": "string",
      "description": "Canonical name of the acquiring company"
    },
    "target_name": {
      "type": "string",
      "description": "Canonical name of the acquired company"
    },
    "deal_size_usd": {
      "type": "number",
      "description": "Announced deal value in USD. Omit if not reported."
    },
    "rationales": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "List of strategic rationales for the transaction"
    }
  },
  "required": [
    "acquirer_name",
    "target_name",
    "rationales"
  ]
}

‍

However, having a precise schema won’t help if the system fails to retrieve the right web pages in the first place. So, to overcome this and build reliable pipelines, you can use web search APIs to retrieve a more complete set of relevant web documents and maximize the factual coverage of their schemas.

What Are Validation Prompts and Why Do They Matter?

JSON schema validation prompts are natural language instructions. They guide a secondary independent model to audit extracted JSON fields for factual accuracy and logical alignment with the source text.

While JSON schemas control output syntax, they can't verify if the model fabricated a number or ignored facts. A validation workflow addresses this by passing extracted JSON and raw documents to a validation model. This secondary layer checks for logic errors, verifies field completeness, and greatly reduces hallucinations by requiring verbatim textual evidence for every attribute.

To keep validation workflows strong, teams use multi-step pipelines that separate the research phase (data collection) from the analysis phase (extraction and validation). This isolated architecture prevents expensive API retry loops and protects against silent generation failures.

An independent LLM parsing extracted JSON fields against the original source web documents to check for factual accuracy and hallucinations. | Source

‍

Here is a practical example of a validation prompt for auditing the corporate acquisition schema.

‍

System: You are a strict data validation agent. Your task is to verify that the Extracted JSON perfectly aligns with the Source Text.


Rules:
1. You must not use outside knowledge.
2. Every value in the Extracted JSON must have verbatim textual evidence in the Source Text.
3. If a value is inferred or hallucinated, mark it as an error.


Source Text: {raw_search_snippet}
Extracted JSON: {extracted_json_payload}


Task: Evaluate the extraction and return a JSON object using the following schema:
{
  "is_valid": boolean,
  "hallucinated_fields": ["list", "of", "keys"],
  "explanation": "Brief explanation of any failures."
}

What Can Go Wrong During Structured Extraction?

Structured extraction pipelines fail due to unexpected API token truncation, incomplete document retrieval, and static validation logic that misses semantic errors.

Why Do LLMs Return Invalid JSON?

Models return invalid JSON most often because of output length limits, schema complexity, or conflicts between the extraction task and the model's default generation behavior. Some common failure modes include:

Failure Mode	Description	Mitigation Strategy
Truncated output	The model hits a token limit mid-JSON and leaves the structure unclosed. Common when processing long documents with complex nested schemas.	Increase the max_tokens parameter in the API request to a value high enough to complete the entire schema.
Schema violations	The model uses a string where the schema expects a number, or returns an enum value that is not in the allowed set.	Enforce strict structured outputs at the API level (like token masking) to restrict the model from generating invalid types.
Markdown wrapping	Models default to wrapping JSON in code fences (e.g., ```json ... ```), which breaks direct JSON parsers.	Programmatically strip markdown fences and whitespace from the output string before passing it to your JSON parser.
Extra commentary	The model adds conversational explanation text before or after the JSON object, invalidating the parse.	Add explicit system instructions demanding the model output ONLY valid JSON and absolutely no other text.
Missing required fields	The model omits fields it cannot find in the source text rather than explicitly handling the missing data.	Prompt the model to explicitly use null for any required fields that cannot be found in the retrieved documents.

Why Does Structured Data Contain Hallucinations?

Structured data can contain hallucinations because LLMs prioritize meeting schema requirements over strictly adhering to the source text. If a schema marks "CEO_Name" as a required field, and the source text does not mention the CEO, the model will often fabricate a name to satisfy the schema constraint.

Models also suffer from severe confidence issues when faced with incomplete source material. If the search layer fails to retrieve the document containing the actual answer, the LLM rarely admits that it cannot find the data. Instead, it confidently fills the field using outdated information from its pre-training data, presenting a fabricated guess as a verified fact.

There are two fixes to prevent this. First, improve retrieval using a recall-first web search solution like CatchAll, so the source material actually contains the needed information. And secondly, you can add validation prompts that explicitly check source grounding for each field.

How Do Missing Search Results Affect Extraction Accuracy?

Missing search results degrade extraction accuracy because the structured extraction pipeline cannot process information that was never retrieved from the web. When search engines miss critical web documents, LLMs generate empty fields or hallucinate to meet schema requirements.

In high-stakes fields like regulatory monitoring or risk assessment, missing a single source creates severe operational blind spots. Therefore, the structural completeness of any semi-structured data extraction pipeline is dependent on the coverage and recall of the search layer.

Note: JSON schemas and validation prompts can improve output quality, but they cannot fix missing data. To prevent hallucinations and extraction errors caused by insufficient source material, you must use a recall-first web search solution like CatchAll. It ensures AI agents and extraction systems retrieve a more complete set of relevant documents before structured extraction begins.

How Does Search Recall Affect Structured Extraction?

Search recall is a major bottleneck in structured data extraction pipelines. Because low document coverage limits the factual scope of the LLM and renders the best schemas and prompts ineffective.

Most search APIs fail at this because of the prioritization to return the top-ranked results, which caps the amount of data your system receives. But, machine workflows and AI research agents require high recall to find all instances of distributed facts, such as compiling every corporate acquisition or cybersecurity event across the open web.

When search recall is low, AI systems miss important target data, leave the extraction engine blind to critical signals, and hallucinate to satisfy the schema. So, to prevent data loss, implement recall-first retrieval strategies. They provide a retrieval layer that maximizes source coverage and lets your LLM access the comprehensive dataset before structured extraction.

Summary

Custom structured data extraction depends on high-quality retrieval, rigid schema enforcement, and aggressive validation logic working in unison.

You must eliminate variables at every step of the pipeline. Define exact output parameters using strict JSON blueprints. Route extracted data through secondary validation prompts to verify source grounding and syntax accuracy.

Importantly, ensure your system processes a more complete set of source documents. The most advanced extraction prompt will fail if the underlying search API fails to retrieve the necessary data.

Documentation: newscatcherapi.com/docs/web-search-api/get-started/introduction
Questions? Email us at support@newscatcherapi.com
Further Reading

Also interesting

all articles

Black thin grid lines forming diamond-shaped pattern on a white background.

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh Head Of Product

Also interesting

all articles

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher

Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara

CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara

CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher

Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher

Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara

CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher

Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher

Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher

Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher

Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara

CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher

Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing

Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh

Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova

Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara

CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan

Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova

Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh

Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh

Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh

Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara

CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara

CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi

CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh

Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh

Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara

CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara

CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara

CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh

Head Of Product

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Introduction

How Does Structured Data Extraction from Web Search Results Work?

How Do JSON Schemas Improve Structured Data Extraction?

What Are Validation Prompts and Why Do They Matter?

What Can Go Wrong During Structured Extraction?

Why Do LLMs Return Invalid JSON?

Why Does Structured Data Contain Hallucinations?

How Do Missing Search Results Affect Extraction Accuracy?

How Does Search Recall Affect Structured Extraction?

Summary

Also interesting

Also interesting

DEVELOPERS

Technology