Clustering news articles

Overview

Clustering groups articles by semantic similarity, not just keyword overlap. When you enable clustering on a Search or Latest Headlines request, the API returns articles organized into clusters — each cluster containing articles that cover the same story or topic. Use clustering to:

Identify how different sources cover the same story.
Spot emerging trends across large volumes of articles.
Track how a story develops over time as cluster composition changes.
Reduce manual organization work when processing high-volume news data.

Clustering is available for all languages supported by News API.

How clustering works

Clustering involves two distinct stages that happen at different points in time: embedding generation, which happens as part of the data processing pipeline before any API request, and cluster formation, which happens at query time when you make a request with clustering enabled.

Embedding generation (offline)

As part of the data processing pipeline, each article is converted into a dense vector — called an embedding — that represents its semantic meaning. These embeddings are computed and stored when the article is indexed, not when you make a request. Articles about the same topic produce embeddings that are close together in vector space, even when they use different words.

Diagram showing the offline data pipeline

The embedding model and the fields used to generate embeddings depend on the article’s publication date:

Date range	Embedding model	Fields used
Before 2026-01-01	multilingual-e5-large	Configurable: `content`, `title`, or `summary`
From 2026-01-01 onward	Qwen3-Embedding-0.6B	`title` + `content` (fixed)
Spans both periods	—	Returns an error

Clustering behavior changed on January 1, 2026. For articles published from that date onward, the system uses a new embedding model and clustering algorithm. See the table above and the parameter reference for details.

Cluster formation (query time)

When you make a request with clustering_enabled=true, the backend service retrieves the pre-computed embeddings for the articles that match your query, then runs the Leiden graph community detection algorithm to group them into clusters:

The cosine similarity between each pair of article embeddings is calculated.
Article pairs whose similarity score exceeds the clustering_threshold are connected as edges in a similarity graph.
The Leiden algorithm detects communities within that graph.
Each detected community becomes a cluster with a unique cluster_id.

Diagram showing the Leiden graph community detection process

The Leiden algorithm produces more stable and accurate clusters than the previous density-based method because it optimizes community structure globally rather than locally.

Clustering does not support a date range that spans both before and after January 1, 2026. If your from_ date is before 2026-01-01 00:00:00 and your to_ date is after, the API returns an error. The two date ranges use incompatible embedding spaces. Send separate requests for each period if you need data from both periods.

Parameters

To enable clustering and control its behavior, include the following parameters in your request:

Parameter	Type	Default	Description
`clustering_enabled`	boolean	`false`	Set to `true` to enable clustering.
`clustering_threshold`	float	`0.7`	Minimum cosine similarity required for two articles to be placed in the same cluster. Accepts values from `0` to `1`. Higher values produce smaller, more tightly related clusters.
`clustering_variable`	string	`content`	Deprecated from January 1, 2026 onward. For pre-2026 data, specifies which field is used for embeddings: `content`, `title`, or `summary`. For post-2026 data, this parameter is ignored — clustering always uses `title` + `content`.

Choosing a threshold

The clustering_threshold value controls the trade-off between cluster size and topical precision:

Value	Effect
`0.6`	Larger clusters; more topically diverse
`0.7`	Balanced cluster size and similarity (default)
`0.8`	Smaller clusters; more tightly related articles

Set page size for effective clustering

Clustering operates on one page of results at a time. If related articles are split across multiple pages, they are clustered independently and may end up in separate clusters. To ensure that all related articles are considered together, set page_size to a value greater than or equal to the expected number of results for your query. For example, if your query is likely to return 150 articles, set page_size to at least 150.

Response structure

When clustering is enabled, the API response includes the following fields at the top level:

clusters_count: The total number of clusters found.
clusters: An array of cluster objects.

Each cluster object contains:

cluster_id: A unique identifier for the cluster.
cluster_size: The number of articles in the cluster.
articles: An array of article objects belonging to this cluster.

Code example

The following example searches for articles about renewable energy with clustering enabled, then prints a summary of each cluster.

clustering.py

import os
import datetime
from newscatcher import NewscatcherApi

client = NewscatcherApi(api_key=os.environ["NEWSCATCHER_API_KEY"])

response = client.search.post(
    q="renewable energy",
    lang=["en"],
    from_=datetime.datetime.fromisoformat("2026-04-01 00:00:00+00:00"),
    clustering_enabled=True,
    clustering_threshold=0.7,
    page_size=200,
)

print(f"Found {response.clusters_count} clusters")
for cluster in (response.clusters or []):
    print(f"  Cluster {cluster.cluster_id}: {cluster.cluster_size} articles")
    print(f"    First article: {cluster.articles[0].title}")

The response groups articles into cluster objects:

Show example response with clustering enabled

{
  "status": "ok",
  "total_hits": 182,
  "page": 1,
  "total_pages": 1,
  "page_size": 200,
  "clusters_count": 41,
  "clusters": [
    {
      "cluster_id": "7222464423361803386",
      "cluster_size": 11,
      "articles": [
        {
          "title": "Renewable Energy Investment Hits Record High in Q1",
          "author": "Jane Smith",
          "authors": ["Jane Smith"],
          "published_date": "2026-04-15 17:36:01",
          "published_date_precision": "full",
          "link": "https://example.com/renewable-energy-record",
          "domain_url": "example.com",
          "name_source": "Example News",
          "country": "US",
          "language": "en",
          "description": "Global investment in renewable energy reached...",
          "content": "Full article text...",
          "word_count": 542,
          // ...additional article fields
          "nlp": {
            "theme": "Business, Energy",
            "summary": "Article summary text...",
            "sentiment": {
              "title": 0.12,
              "content": 0.34
            }
            // qwen_embedding omitted — 1024-float array; see Work with embeddings directly
          }
        }
        // ...additional articles in this cluster
      ]
    }
    // ...additional clusters
  ]
}

Work with embeddings directly

The same Qwen3 embeddings used for clustering are also available in the API response for you to use in your own pipelines. This is useful when the built-in clustering does not match your use case, or when you need to work with more articles than fit in a single request. Common use cases include:

Semantic search over your own article corpus.
Recommendation systems.
Deduplication with custom similarity thresholds.
Topic visualization.
Domain-specific clustering algorithms.

Embeddings output requires the v3_nlp_embeddings subscription plan. Qwen3 embeddings is only available for articles indexed from January 1, 2026 onward.

To request embeddings in the response, set include_nlp_data=True. Each article’s embedding is returned in article.nlp.qwen_embedding as an array of 1024 floats.

fetch_embeddings.py

import os
import datetime
import numpy as np
from newscatcher import NewscatcherApi

client = NewscatcherApi(api_key=os.environ["NEWSCATCHER_API_KEY"])

response = client.search.post(
    q="artificial intelligence",
    lang=["en"],
    from_=datetime.datetime.fromisoformat("2026-01-01 00:00:00+00:00"),
    page_size=100,
    include_nlp_data=True,
    has_nlp=True,
    embeddings_output="qwen_embedding",
)

articles = response.articles or []

# Build a matrix of embeddings — one row per article
embeddings = np.array(
    [
        getattr(art.nlp, "qwen_embedding")
        for art in articles
        if art.nlp and getattr(art.nlp, "qwen_embedding", None) is not None
    ],
    dtype=np.float32,
)

print(f"Retrieved {len(articles)} articles")
print(f"Embedding matrix shape: {embeddings.shape}")  # (n_articles, 1024)

Once you have the embedding matrix, you can use it in your own pipelines. The following examples use scikit-learn, which is not included in newscatcher-sdk. Install it separately:

pip install scikit-learn

Semantic similarity
Custom clustering
Dimensionality reduction

Find articles most similar to a reference article without making additional API calls:

semantic_similarity.py

from sklearn.metrics.pairwise import cosine_similarity

# Similarity of every article against the first one
scores = cosine_similarity(embeddings[0:1], embeddings)[0]
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)

print("Most similar articles:")
for idx, score in ranked[1:6]:
    print(f"  {score:.3f}  {articles[idx].title}")

Apply any clustering algorithm that accepts dense vectors:

custom_clustering.py

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=10, metric="cosine", linkage="average")
labels = model.fit_predict(embeddings)

for article, label in zip(articles, labels):
    print(f"Cluster {label}: {article.title}")

Project embeddings into 2D to explore topic structure:

visualize.py

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

coords = PCA(n_components=2).fit_transform(embeddings)
plt.scatter(coords[:, 0], coords[:, 1], alpha=0.6)
for i, art in enumerate(articles):
    plt.annotate(art.title[:40], coords[i], fontsize=6)
plt.show()

Clustering vs. deduplication

Clustering and deduplication both help organize large sets of articles, but they serve different purposes:

	Clustering	Deduplication
Purpose	Groups related articles	Removes near-identical articles
Output	Groups of related articles	A set of unique articles
Similarity threshold	Lower — allows broader groupings	Higher — identifies near-exact matches
Effect on article count	Retains all articles	Removes duplicates
Best for	Trend analysis, multi-source coverage	Removing redundancy, ensuring uniqueness

For more information, see Articles deduplication.

Get started

Guides and concepts

How to

Troubleshooting

API Reference

Libraries

Clustering news articles

Overview

How clustering works

Embedding generation (offline)

Cluster formation (query time)

Parameters

Choosing a threshold

Set page size for effective clustering

Response structure

Code example

Work with embeddings directly

Clustering vs. deduplication

Get started

Guides and concepts

How to

Troubleshooting

API Reference

Libraries

Documentation Index

​Overview

​How clustering works

​Embedding generation (offline)

​Cluster formation (query time)

​Parameters

​Choosing a threshold

​Set page size for effective clustering

​Response structure

​Code example

​Work with embeddings directly

​Clustering vs. deduplication

Overview

How clustering works

Embedding generation (offline)

Cluster formation (query time)

Parameters

Choosing a threshold

Set page size for effective clustering

Response structure

Code example

Work with embeddings directly

Clustering vs. deduplication