Behind The Scenes:
The Tech Driving Our API

Discover how our API processes data to deliver unmatched insights.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Intelligent Scheduling Algorithm
Our process begins with a proprietary scheduling algorithm that monitors the publication frequencies of different sources over a week. This data informs our crawlers, allowing us to efficiently gather new article links without overwhelming system resources. This method ensures an optimal balance between timeliness and resource utilization.

Data Acquisition
We fetch and store the raw webpage for each article link. This archival strategy provides the flexibility to enhance data extraction methods retrospectively as new techniques become available, ensuring continuous improvement in data quality.

Extraction Techniques
We utilize five distinct extraction methods to retrieve article data, including two advanced adaptations of open-source technologies and three proprietary techniques developed in-house. This diverse toolkit enables us to handle a wide range of article formats and data types effectively.

Data Integration and Deduplication
After extraction, data from different sources is integrated into a unified article format. Our system applies advanced deduplication techniques, ensuring that each article is unique and consistently formatted, using a combination of URL and an internally generated ID based on various data points. The extraction process particularly focuses on the accuracy of the full article text, publish dates, and author details.

Data Cleaning
The next phase involves a comprehensive data cleaning process. We use a detailed directory of patterns to identify and remove irrelevant information. This meticulous approach significantly enhances the quality of the information.

A very big player in the PR/news media space switched to us because we have better, cleaner coverage of authors compared to their last (billion-dollar valuation 🤭) news provider: our 88% to their 35%.

NLP Pipeline
Cleaned articles are processed through an advanced Natural Language Processing (NLP) pipeline. This stage includes summarizing the content, classifying articles into broad news topics, detecting named entities, and assessing sentiment. This enriches the articles, making them more actionable and insightful for users.

Indexing and Distribution
Processed articles are indexed in our main production ES clusters for querying. We also distribute specific datasets to dedicated client clusters and shared cloud storage to ensure high availability and performance.

We can create timely data dumps into your cloud storage, obviously create APIs, data streams, and maybe even create a cache of your query results, every minute. 8 out of 10 people on our team are engineers, so data delivery won’t be an issue.

Query Processing
Our system dynamically filters and groups articles based on user queries, employing sophisticated algorithms to cluster similar articles and deliver highly relevant results swiftly and efficiently.

Custom Solutions
We continuously develop custom solutions tailored to the unique needs of our clients. This bespoke service is part of our commitment to delivering exceptional value and adapting to the unique challenges faced by our users. Here are some that we have built already.

Custom Solutions for Every Need

NewsCatcher extends beyond standard offerings to provide customized solutions for diverse enterprise requirements.

Entity Disambiguation

Cut through the clutter with precision - ensure every article pinpoints the exact company or individual you’re tracking.

Events Intelligence

Leverage our global event data stream to stay ahead in the market and turn insights into actionable business strategies.

Insights Engine

Unearth hidden gems and nurture their growth - our market intelligence shines a spotlight on emerging opportunities awaiting your touch.

Localized News

Keep your finger on the pulse of any town or region - our localized news coverage brings you the latest happenings right where they matter.

Find out More

Not worrying about the input of the data so that we can do things like that is essential. It’s almost like we were a farm-to-table restaurant, and we were growing our own vegetables. And then NewsCatcher guys came in and said, ‘You don’t have to worry about that. Just focus on the kitchen.

Mishaal Al Gergawi,CEO, Axis