How to paginate large datasets

Overview

When working with large datasets in News API v3, pagination is essential for efficiently retrieving and processing news articles. Pagination allows you to break down large result sets into smaller, manageable chunks, improving performance and reducing the load on both the client and server. News API v3 uses a cursor-based pagination system, which is ideal for handling large, dynamic datasets. This guide will walk you through the process of implementing pagination in your API requests.

Before you start

Before you begin, ensure you have:

An active API key for NewsCatcher News API v3
Basic knowledge of making API requests
Python or another tool for making HTTP requests (e.g., cURL, Postman, or a programming language with HTTP capabilities)

Pagination is available on the following endpoints:

A single API response cannot return more than 1000 articles, so you should use pagination to retrieve larger datasets.

Steps

Understand pagination parameters

News API v3 uses two main parameters for pagination:

page: The page number you want to retrieve (default is 1, starts from 1).
page_size: The number of results per page (default is 100, range is 1 to 1000).

Construct your initial query

Start by setting up your basic query with pagination parameters. For example:

{
  "q": "artificial intelligence",
  "lang": "en",
  "page": 1,
  "page_size": 100
}

Make the first API request

Here’s a Python example demonstrating the initial request:

request.py

import requests
import json

# Configuration
API_KEY = "YOUR_API_KEY_HERE"
URL = "https://v3-api.newscatcherapi.com/api/search"
HEADERS = {"x-api-token": API_KEY}

PAYLOAD = {"q": "artificial intelligence", "lang": "en", "page": 1, "page_size": 100}

try:
    response = requests.post(URL, headers=HEADERS, json=PAYLOAD)
    response.raise_for_status()
    data = response.json()
    if "total_pages" not in data or "articles" not in data:
        print(f"Unexpected response format: {data}")
    else:
        print(json.dumps(data, indent=2))
except requests.exceptions.RequestException as e:
    print(f"Failed to fetch articles: {e}, Status code: {response.status_code if response else 'N/A'}")

Analyze the pagination-related information in the response

The API response includes several fields related to pagination:

{
  "status": "ok",
  "total_hits": 10000,
  "page": 1,
  "total_pages": 100,
  "page_size": 100
}

total_hits: The total number of articles matching your query.
page: The current page number.
total_pages: The total number of pages available.
page_size: The number of articles per page.

Implement pagination logic

To retrieve all pages, you’ll need to loop through them. Here’s an example of how to do this with enhanced error handling and exponential backoff:

paginate.py

import requests
import time
import random

API_KEY = "YOUR_API_KEY_HERE"
URL = "https://v3-api.newscatcherapi.com/api/search"
HEADERS = {"x-api-token": API_KEY}

PAYLOAD = {
    "q": "artificial intelligence",
    "lang": "en",
    "page_size": 1000,  # Example with larger page size
}

def exponential_backoff(retries):
    time.sleep((2**retries) + random.uniform(0, 1))

def fetch_all_pages():
    all_articles = []
    page = 1
    total_pages = None
    retries = 0
    max_retries = 5

    while total_pages is None or page <= total_pages:
        PAYLOAD["page"] = page
        try:
            response = requests.post(URL, headers=HEADERS, json=PAYLOAD)
            response.raise_for_status()
            data = response.json()

            if "total_pages" not in data or "articles" not in data:
                print(f"Unexpected response format on page {page}: {data}")
                break

            # Use the total_pages directly from the API response
            if total_pages is None:
                total_pages = data["total_pages"]

            all_articles.extend(data["articles"])
            print(f"Fetched page {page} of {total_pages}")

            if page >= total_pages:
                break

            page += 1
            time.sleep(1)  # Respect rate limits
            retries = 0  # Reset retries after a successful request

        except requests.exceptions.RequestException as e:
            print(f"Failed to fetch page {page}: {e}")
            retries += 1
            if retries >= max_retries:
                print("Max retries reached, aborting.")
                break
            exponential_backoff(retries)

    return all_articles

articles = fetch_all_pages()
print(f"Total articles fetched: {len(articles)}")

Optimize requests

To efficiently fetch large datasets while respecting API rate limits, use the following strategies:

Add delays between requests, such as a fixed sleep time, or implement an exponential backoff strategy for retries in case of failures (as shown in the previous example).
Fetch data in manageable batches to avoid memory issues with large datasets.
Use multithreading or asynchronous functions to speed up the process while respecting API subscription limits.

Here is an example of asynchronous requests using aiohttp with concurrency, a retry mechanism, and logging:

paginate_async.py

import aiohttp
import asyncio
import os
import logging
import json
from typing import Dict, List, Optional, Any
from tqdm.asyncio import tqdm

# Constants
API_KEY: str = os.getenv("NEWSCATCHER_API_KEY")
if not API_KEY:
    raise EnvironmentError("API key not set in environment variables")

URL: str = "https://v3-api.newscatcherapi.com/api/search"
HEADERS: Dict[str, str] = {"x-api-token": API_KEY}
PAYLOAD: Dict[str, Any] = {
    "q": "artificial intelligence",
    "lang": "en",
    "page_size": 100,
}
MAX_CONCURRENT_REQUESTS: int = 5  # Set the desired number of concurrent requests
MAX_RETRIES: int = 3  # Number of retries for failed requests
TIMEOUT: int = 30  # Timeout for each request in seconds
OUTPUT_FILE: str = "fetched_articles.json"

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler("fetch_articles.log"),  # Log to a file
        logging.StreamHandler(),  # Also log to console
    ],
)


async def fetch_page(
    session: aiohttp.ClientSession,
    url: str,
    headers: Dict[str, str],
    payload: Dict[str, Any],
    timeout: int = TIMEOUT,
    max_retries: int = MAX_RETRIES,
) -> Optional[Dict]:
    """Fetch a single page with retry logic."""
    retries: int = 0
    while retries < max_retries:
        try:
            async with session.post(
                url, headers=headers, json=payload, timeout=timeout
            ) as response:
                response.raise_for_status()
                return await response.json()
        except aiohttp.ClientResponseError as e:
            retries += 1
        except aiohttp.ClientConnectionError:
            retries += 1
        except asyncio.TimeoutError:
            retries += 1
        except aiohttp.ClientError as e:
            retries += 1

    return None  # Explicitly return None if all retries fail


async def fetch_page_with_semaphore(
    semaphore: asyncio.Semaphore,
    session: aiohttp.ClientSession,
    url: str,
    headers: Dict[str, str],
    payload: Dict[str, Any],
) -> Optional[Dict]:
    """Fetch a page using a semaphore to limit concurrent requests."""
    async with semaphore:
        return await fetch_page(session, url, headers, payload)


async def fetch_all_pages_concurrently(
    url: str,
    headers: Dict[str, str],
    initial_payload: Dict[str, Any],
    max_concurrent_requests: int = MAX_CONCURRENT_REQUESTS,
) -> List[Dict[str, Any]]:
    """Fetch all pages concurrently using a semaphore to limit the number of requests."""
    semaphore: asyncio.Semaphore = asyncio.Semaphore(max_concurrent_requests)
    all_articles: List[Dict[str, Any]] = []

    async with aiohttp.ClientSession(
        connector=aiohttp.TCPConnector(limit=10)
    ) as session:
        # Fetch the first page to determine the total number of pages
        async with semaphore:
            initial_payload["page"] = 1
            data: Optional[Dict[str, Any]] = await fetch_page(
                session, url, headers, initial_payload
            )

            if not data or "total_pages" not in data or "articles" not in data:
                return []

            total_pages: int = data["total_pages"]
            all_articles.extend(data["articles"])

        # Prepare tasks for fetching all remaining pages
        tasks: List[asyncio.Task] = []
        for page in range(2, total_pages + 1):
            payload: Dict[str, Any] = initial_payload.copy()
            payload["page"] = page
            task: asyncio.Task = fetch_page_with_semaphore(
                semaphore, session, url, headers, payload
            )
            tasks.append(
                asyncio.wait_for(task, timeout=60)
            )  # Increased timeout for tasks

        # Execute all tasks concurrently and gather results with tqdm progress bar
        results: List[Optional[Dict[str, Any]]] = await tqdm.gather(
            *tasks, desc="Fetching pages", unit="page", initial=1, total=total_pages
        )

        # Process results
        for result in results:
            if result and "articles" in result:
                all_articles.extend(result["articles"])

    return all_articles


def main() -> None:
    """Main function to execute pagination and fetch all articles."""
    # Fetch all articles
    articles: List[Dict[str, Any]] = asyncio.run(
        fetch_all_pages_concurrently(URL, HEADERS, PAYLOAD, MAX_CONCURRENT_REQUESTS)
    )

    # Save fetched data to a JSON file
    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        json.dump(articles, f, ensure_ascii=False, indent=2)

    logging.info(f"Total articles fetched: {len(articles)}")
    logging.info(f"Data saved to {OUTPUT_FILE}")


if __name__ == "__main__":
    main()

Best practices

Use smaller page sizes (e.g., 20-50) for faster initial load times in user interfaces.
Use larger page sizes (up to 1000) for batch processing or to retrieve the entire dataset.
Be aware that the dataset may change between requests, especially for queries on recent news.
Implement error handling and retries to make your pagination code more robust.
Consider implementing a way to resume pagination from a specific page in case of interruptions.
When using multithreading or async functions, carefully manage concurrency to stay within your API usage limits.

Get started

Guides and concepts

How to

Troubleshooting

Migration

How to paginate large datasets

Overview

Before you start

Steps

Optimize requests

Best practices

See also

Get started

Guides and concepts

How to

Troubleshooting

Migration

​Overview

​Before you start

​Steps

​Optimize requests

​Best practices

​See also

Overview

Before you start

Steps

Optimize requests

Best practices

See also