How to retrieve more than 10,000 articles
Learn how to use time-chunking methods in the Python SDK to retrieve large volumes of articles
The Newscatcher API limits results to 10,000 articles per search query. The Python SDK provides special methods that automatically split your search across multiple time periods to bypass the limit and retrieve all articles relevant to your query.
These advanced retrieval methods are available only in the Python SDK.
Understanding the article limit
When your query matches more than 10,000 articles, the API returns
"total_hits": 10000
as a hard limit, and you cannot retrieve more through
standard pagination.
Using time-chunking methods
The SDK provides two special methods to retrieve large volumes of articles:
get_all_articles
get_all_headlines
Both methods available for synchronous and asynchronous clients.
Get all articles
Get all headlines
How time-chunking works
Time-chunking divides your date range into smaller intervals, making separate API calls for each period and combining the results. Each interval can return up to 10,000 articles.
For example, with time_chunk_size="1d"
over 5 days, the method makes 5 API
calls, one for each day, with auto pagination to potentially retrieve up to
50,000 articles.
Choosing the right chunk size
The optimal chunk size depends on how many articles your query returns:
Query type | Articles per day | Recommended chunk size |
---|---|---|
Extremely broad | 10,000+ per hour | "1h" |
Very broad | 10,000+ per day | "6h" |
Broad | 3,000-10,000 per day | "1d" |
Moderate | 1,000-3,000 per day | "3d" |
Specific | 100-1,000 per day | "7d" |
Very specific | < 100 per day | "30d" |
Query type | Articles per day | Recommended chunk size |
---|---|---|
Extremely broad | 10,000+ per hour | "1h" |
Very broad | 10,000+ per day | "6h" |
Broad | 3,000-10,000 per day | "1d" |
Moderate | 1,000-3,000 per day | "3d" |
Specific | 100-1,000 per day | "7d" |
Very specific | < 100 per day | "30d" |
Method parameters
Your search query. Supports AND, OR, NOT operators and advanced syntax.
Starting date for get_all_articles
(e.g., "10d"
or "2023-03-15"
).
Ending date for get_all_articles
defaults to current time.
Time range for get_all_headlines
(e.g., "1d"
or "2023-03-15"
).
Chunk size: "1h"
, "6h"
, "1d"
, "7d"
, "1m"
.
Maximum number of articles to retrieve.
Whether to display a progress bar.
Whether to remove duplicate articles.
For async methods only: number of concurrent requests.
Common issues and solutions
Rate limiting errors
Rate limiting errors
If you hit rate limits:
- Reduce concurrency (for async methods).
- Add longer delays between requests.
- Break large requests into smaller batches.
Memory errors
Memory errors
If you run out of memory:
- Reduce
max_articles
parameter. - Process data in smaller batches.
- Save results incrementally as shown in the advanced example.
- Release memory with
del
andgc.collect()
.
Missing results
Missing results
If results are incomplete:
- Check if your chunk size is appropriate.
- Ensure your date range is correct.
- Verify your query syntax is valid.
- Make sure you’re not hitting the 10,000 limit per chunk.