Skip to main content

Documentation Index

Fetch the complete documentation index at: https://newscatcherinc-docs.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Clustering groups articles by semantic similarity, not just keyword overlap. When you enable clustering on a Search or Latest Headlines request, the API returns articles organized into clusters — each cluster containing articles that cover the same story or topic. Use clustering to:
  • Identify how different sources cover the same story.
  • Spot emerging trends across large volumes of articles.
  • Track how a story develops over time as cluster composition changes.
  • Reduce manual organization work when processing high-volume news data.
Clustering is available for all languages supported by News API.

How clustering works

Clustering involves two distinct stages that happen at different points in time: embedding generation, which happens as part of the data processing pipeline before any API request, and cluster formation, which happens at query time when you make a request with clustering enabled.

Embedding generation (offline)

As part of the data processing pipeline, each article is converted into a dense vector — called an embedding — that represents its semantic meaning. These embeddings are computed and stored when the article is indexed, not when you make a request. Articles about the same topic produce embeddings that are close together in vector space, even when they use different words. Diagram showing the offline data pipeline The embedding model and the fields used to generate embeddings depend on the article’s publication date:
Date rangeEmbedding modelFields used
Before 2026-01-01multilingual-e5-largeConfigurable: content, title, or summary
From 2026-01-01 onwardQwen3-Embedding-0.6Btitle + content (fixed)
Spans both periodsReturns an error
Clustering behavior changed on January 1, 2026. For articles published from that date onward, the system uses a new embedding model and clustering algorithm. See the table above and the parameter reference for details.

Cluster formation (query time)

When you make a request with clustering_enabled=true, the backend service retrieves the pre-computed embeddings for the articles that match your query, then runs the Leiden graph community detection algorithm to group them into clusters:
  1. The cosine similarity between each pair of article embeddings is calculated.
  2. Article pairs whose similarity score exceeds the clustering_threshold are connected as edges in a similarity graph.
  3. The Leiden algorithm detects communities within that graph.
  4. Each detected community becomes a cluster with a unique cluster_id.
Diagram showing the Leiden graph community detection process The Leiden algorithm produces more stable and accurate clusters than the previous density-based method because it optimizes community structure globally rather than locally.
Clustering does not support a date range that spans both before and after January 1, 2026. If your from_ date is before 2026-01-01 00:00:00 and your to_ date is after, the API returns an error. The two date ranges use incompatible embedding spaces. Send separate requests for each period if you need data from both periods.

Parameters

To enable clustering and control its behavior, include the following parameters in your request:
ParameterTypeDefaultDescription
clustering_enabledbooleanfalseSet to true to enable clustering.
clustering_thresholdfloat0.7Minimum cosine similarity required for two articles to be placed in the same cluster. Accepts values from 0 to 1. Higher values produce smaller, more tightly related clusters.
clustering_variablestringcontentDeprecated from January 1, 2026 onward.
For pre-2026 data, specifies which field is used for embeddings: content, title, or summary.
For post-2026 data, this parameter is ignored — clustering always uses title + content.

Choosing a threshold

The clustering_threshold value controls the trade-off between cluster size and topical precision:
ValueEffect
0.6Larger clusters; more topically diverse
0.7Balanced cluster size and similarity (default)
0.8Smaller clusters; more tightly related articles

Set page size for effective clustering

Clustering operates on one page of results at a time. If related articles are split across multiple pages, they are clustered independently and may end up in separate clusters. To ensure that all related articles are considered together, set page_size to a value greater than or equal to the expected number of results for your query. For example, if your query is likely to return 150 articles, set page_size to at least 150.

Response structure

When clustering is enabled, the API response includes the following fields at the top level:
  • clusters_count: The total number of clusters found.
  • clusters: An array of cluster objects.
Each cluster object contains:
  • cluster_id: A unique identifier for the cluster.
  • cluster_size: The number of articles in the cluster.
  • articles: An array of article objects belonging to this cluster.

Code example

The following example searches for articles about renewable energy with clustering enabled, then prints a summary of each cluster.
clustering.py
import os
import datetime
from newscatcher import NewscatcherApi

client = NewscatcherApi(api_key=os.environ["NEWSCATCHER_API_KEY"])

response = client.search.post(
    q="renewable energy",
    lang=["en"],
    from_=datetime.datetime.fromisoformat("2026-04-01 00:00:00+00:00"),
    clustering_enabled=True,
    clustering_threshold=0.7,
    page_size=200,
)

print(f"Found {response.clusters_count} clusters")
for cluster in (response.clusters or []):
    print(f"  Cluster {cluster.cluster_id}: {cluster.cluster_size} articles")
    print(f"    First article: {cluster.articles[0].title}")
The response groups articles into cluster objects:

Work with embeddings directly

The same Qwen3 embeddings used for clustering are also available in the API response for you to use in your own pipelines. This is useful when the built-in clustering does not match your use case, or when you need to work with more articles than fit in a single request. Common use cases include:
  • Semantic search over your own article corpus.
  • Recommendation systems.
  • Deduplication with custom similarity thresholds.
  • Topic visualization.
  • Domain-specific clustering algorithms.
Embeddings output requires the v3_nlp_embeddings subscription plan. Qwen3 embeddings is only available for articles indexed from January 1, 2026 onward.
To request embeddings in the response, set include_nlp_data=True. Each article’s embedding is returned in article.nlp.qwen_embedding as an array of 1024 floats.
fetch_embeddings.py
import os
import datetime
import numpy as np
from newscatcher import NewscatcherApi

client = NewscatcherApi(api_key=os.environ["NEWSCATCHER_API_KEY"])

response = client.search.post(
    q="artificial intelligence",
    lang=["en"],
    from_=datetime.datetime.fromisoformat("2026-01-01 00:00:00+00:00"),
    page_size=100,
    include_nlp_data=True,
    has_nlp=True,
    embeddings_output="qwen_embedding",
)

articles = response.articles or []

# Build a matrix of embeddings — one row per article
embeddings = np.array(
    [
        getattr(art.nlp, "qwen_embedding")
        for art in articles
        if art.nlp and getattr(art.nlp, "qwen_embedding", None) is not None
    ],
    dtype=np.float32,
)

print(f"Retrieved {len(articles)} articles")
print(f"Embedding matrix shape: {embeddings.shape}")  # (n_articles, 1024)
Once you have the embedding matrix, you can use it in your own pipelines. The following examples use scikit-learn, which is not included in newscatcher-sdk. Install it separately:
pip install scikit-learn
Find articles most similar to a reference article without making additional API calls:
semantic_similarity.py
from sklearn.metrics.pairwise import cosine_similarity

# Similarity of every article against the first one
scores = cosine_similarity(embeddings[0:1], embeddings)[0]
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)

print("Most similar articles:")
for idx, score in ranked[1:6]:
    print(f"  {score:.3f}  {articles[idx].title}")

Clustering vs. deduplication

Clustering and deduplication both help organize large sets of articles, but they serve different purposes:
ClusteringDeduplication
PurposeGroups related articlesRemoves near-identical articles
OutputGroups of related articlesA set of unique articles
Similarity thresholdLower — allows broader groupingsHigher — identifies near-exact matches
Effect on article countRetains all articlesRemoves duplicates
Best forTrend analysis, multi-source coverageRemoving redundancy, ensuring uniqueness
For more information, see Articles deduplication.