Rigorous benchmark of CatchAll vs Exa, Parallel AI, and OpenAI Deep Search across 35 real-world queries. See why CatchAll achieves 3x better recall with 60% higher F1 scores for comprehensive event discovery.

When we tell people that CatchAll finds 3x more relevant events than competitors, the natural question is: “Can you prove it?”

This is the complete story of how we benchmarked CatchAll against Exa Websets, Parallel AI FindAll, and OpenAI Deep Search using 35 real-world queries, rigorous evaluation methodology, and transparent reporting of limitations.

Why We Did This

The architectural bet: CatchAll’s design optimizes for recall (finding more relevant events) over precision (showing only perfect results). The more common approach in the market prioritizes precision—returning a smaller, highly filtered set of results to maximize accuracy. This works brilliantly for traditional web search (“show me the top 10 articles about AI”) but fails for comprehensive event detection (“find all AI acquisitions in December”).

The skepticism: Early customers questioned our approach: “So you’re trading precision for recall? How do I know the extra results are worth sifting through?”

The commitment: We built a rigorous, reproducible evaluation framework to answer this with data, not claims:

  1. Test against real competitors using real queries
  2. Use objective metrics, not cherry-picked anecdotes
  3. Be transparent about limitations and failures
  4. Be re-runnable as we improve

Evaluation Methodology

Query Design: 35 Real-World Questions

We crafted queries reflecting actual business use cases: market intelligence (“Catch all AI acquisitions Dec 1–7, 2025”), regulatory monitoring (“Catch all product recalls Dec 1–7, 2025”), risk detection (“Catchall data breaches Dec 1–7, 2025”), and labor tracking (“Catch all labor strikes Dec 1–7, 2025”).

Each query specified exact time periods (7-14 days, adjusted to 1-3 days for high-volume topics), creating a reproducible snapshot that we could periodically retest.

Providers and Configuration

We tested CatchAll, Exa Websets, Parallel AI, and OpenAI Deep Search. Each was configured to return maximum results—this is event discovery comparison, not search ranking.

Metrics That Matter

Core definitions:

  • TP (True Positives): Relevant events correctly identified
  • FP (False Positives): Irrelevant results returned
  • FN (False Negatives): Relevant events missed by one provider but found by competitors

Precision: Of the returned results, what percentage were relevant?

\[ \text{Precision} = \frac{TP}{TP + FP} \]

Observable Recall: Of all relevant events found by ANY provider, what percentage did each provider find?

\[ \text{Observable Recall} = \frac{TP}{TP + FN} \]

where FN = unique events found by competitors that evaluated provider missed.

F1 Score (Primary): Harmonic mean of precision and recall.

\[ F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \]

Critical limitation: We measure "recall within observable universe" (events found by at least one provider), not absolute recall (all events that occurred). If 500 AI acquisitions happened, but tools found 329 combined, our recall is based on 329. True recall might be lower for all providers.

This limitation remains valid because it affects all providers equally. The relative comparison (CatchAll’s 77.5% vs Exa’s 24.6%) remains meaningful—CatchAll finds 3x more events within the same observable universe.

LLM-as-a-Judge Evaluation

We fine-tuned Gemini 2.5 Pro to score relevance, validated it against 1,000 manual tags, achieving 92% agreement. CatchAll had 0.1% evaluation failures; competitors had 40-70% due to URL enrichment issues (explained below).

Deduplication

Without effective deduplication, high recall is meaningless—the same event repeated 50 times provides no value. We tested three approaches:

  • Embeddings + clustering: ~60-70% accuracy—too many duplicates slip through
  • Pure LLM comparison: Accurate but O(n²) prohibitive for large result sets
  • Iterative LLM with keyword grouping: 95%+ accuracy, manageable cost

Winner: Group records by keywords, compare within groups using Claude Opus 4.1, then iteratively refine. Our 94.5% uniqueness rate proves this approach works.

Cross-provider deduplication: After deduplicating within each provider, we ran a second pass across all providers to identify when different tools found the same event. This gives us unique_tp_total (3,755)—the denominator for fair recall calculation.

Our results: Across 35 queries, CatchAll achieved 94.5% uniqueness—higher than Exa (71.5%) and Parallel AI (67.6%), though slightly behind OpenAI Deep Search's 97.4%. This proves that high recall doesn't require tolerating duplicates.

Fairness: URL Enrichment

CatchAll returns complete articles from our 2B article index. Competitors return URLs. To evaluate fairly, we enriched competitor URLs via NewsCatcher APIs (search by URL → parser fallback). This failed 40-70% of the time. When enrichment failed, we “forced” results to true positives to avoid unfair penalization.

Key point: This is an evaluation methodology advantage, not a product feature. In real-world usage, competitors return results that users can access directly.

Reproducibility

Frozen period: November 20 - December 7, 2025. We’ve built an automated pipeline to re-run quarterly with updated results.

Results

Overall Performance

Metric CatchAll Exa Websets Parallel AI OpenAI Deep Search
F1 Score 0.527 0.328 0.067 0.081
Precision 0.437 0.828 0.597 0.816
Recall 0.775 0.246 0.039 0.044
Uniqueness % 94.5% 71.5% 67.6% 97.4%
Query Wins 25/35 10/35 0/35 0/35

⭐ = Best in category

CatchAll wins 71% of queries with 60% better F1 than Exa. Observable recall of 77.5% means we find 3 out of 4 relevant events, while Exa finds 1 out of 4.

Coverage Analysis

Across 35 queries, providers found 3,755 unique events total:

  • CatchAll: 86.8% (3,261 events)
  • Exa: 16.9% (635 events)
  • OpenAI: 2.6% (98 events)
  • Parallel AI: 2.3% (85 events)

CatchAll found more events than all competitors combined. In 66% of queries, we achieved more than twice the recall of our best competitor.

Query Examples

1. AI Funding (High Volume)

“Find all funding of AI products or AI-oriented companies between Dec 1-7, 2025.”

Provider F1 TP FP Observable Recall
CatchAll 0.685 290 228 88.1%
Exa 0.326 64 0 19.5%

CatchAll found 4.5x more funding events.

2. Labor Strikes (Global)

“Find all labor strikes announced between Dec 1-7, 2025.”

Provider F1 TP FP Observable Recall
CatchAll 0.696 351 303 99.2%
Exa 0.054 10 5 2.8%

Near-complete coverage (99.2%)—found 35x more events.

3. Product Recalls (Regulatory)

“Find all product recalls announced between Dec 1-7, 2025.”

Provider F1 TP FP Observable Recall
CatchAll 0.821 236 88 94.0%
Parallel AI 0.126 17 1 6.8%

High recall (94%) AND high F1 (0.821)—balanced performance.

4. Taiwan Workplace Accidents (Where Exa Wins)

“Find all workplace accidents in Taiwan between Nov 24 - Dec 7, 2025.”

Provider F1 TP FP Observable Recall
Exa 0.839 39 12 92.9%
CatchAll 0.327 9 4 21.4%

Exa wins on small, geographically specific queries where precision matters more than coverage.

5. Data Breaches (Security)

“Find all data breaches disclosed between Dec 1-7, 2025.”

Provider F1 TP FP Observable Recall
CatchAll 0.625 316 374 98.1%
Exa 0.094 16 4 5.0%

For security monitoring, missing 95% of breaches is worse than filtering false positives.

What We Learned

1. Architectural Trade-offs Are Design Choices

The numbers reflect our design philosophy: CatchAll’s 77.5% observable recall comes at 43.7% precision cost. Competitors achieve 60-80% precision with 2-25% recall. Neither is wrong—they solve different problems.

When building databases of M&A, funding, or regulatory events, comprehensive coverage with noise beats perfect top-10 results with massive gaps.

Exception: For small result sets, geographically specific queries, or automated trading where false positives are costly, Exa’s precision-first architecture excels.

2. Query Design Reveals Product Fit

Our queries were structured event detection with clear boundaries: “Find all layoffs involving 50+ employees in the US between Dec 1-7, 2025.” This is CatchAll’s target use case. For narrative synthesis or exploratory research, different tools win.

3. Evaluation Methodology Matters

CatchAll had 0.1% forced TPs; competitors had 40-70%. Why? CatchAll returns complete articles from our 2B index. Our competitors return URLs we had to enrich via NewsCatcher APIs (failed 40-70% of the time). We forced failed enrichments to true positives to avoid unfair penalization.

The Bottom Line

CatchAll isn't universally better — it's better at comprehensive event discovery, trading precision for recall. In beta, this reflects our belief that missed events are harder to recover than to filter out false positives.

The numbers:

  • 60% better F1 score (0.527 vs 0.328)
  • 3.1x better observable recall (77.5% vs 24.6%)
  • 71% query win rate (25 out of 35)
  • 86.8% coverage of all events found
  • 94.5% uniqueness rate (clean, deduplicated results)

When to use CatchAll:

  • Building databases of M&A, funding, regulatory filings, incidents
  • Monitoring broad-scope topics (global events, multiple industries)
  • Missing events is worse than filtering false positives

When to use competitors:

  • Small result sets with high precision requirements (Exa)
  • Narrative synthesis over structured data (OpenAI Deep Search)
  • Entity-specific research with validated matches (Parallel AI)
  • Top-10 quality results are sufficient

Ongoing Evaluation & Transparency

This evaluation represents November-December 2025, using CatchAll v0.5.1. As all tools improve, results will change.

Our commitment:

  • Quarterly re-evaluation: Re-run these 35 queries and publish updated results
  • Expanded coverage: Add new queries and competitors as the space evolves
  • Open methodology: Full query list and raw results available upon request
  • Honest reporting: Report when competitors improve and when CatchAll regresses

Where we're improving:

Precision improvements are more straightforward than recall gains—better validation models and refined extraction prompts reduce false positives without touching our coverage architecture. We're working on both. Next quarter's numbers will show whether we can maintain 77.5% recall while closing the precision gap.

Try It Yourself

Run your own queries and see if CatchAll’s recall-first approach works for your use case.

Start free trial →

  • 2000 free credits to start
  • Full access to monitoring and webhooks
  • Compare results with your current tools

Questions about the methodology? Email us at support@newscatcherapi.com

Found an error or have suggestions? We’re committed to rigorous, honest evaluation. If you spot issues with our methodology or want to propose additional tests, please reach out.

Last updated: January 13, 2026 | Evaluation period: November 20 - December 7, 2025 | CatchAll v0.5.1