Documentation Index
Fetch the complete documentation index at: https://newscatcherinc-docs.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Clustering groups articles by semantic similarity, not just keyword overlap. When you enable clustering on a Search or Latest Headlines request, the API returns articles organized into clusters — each cluster containing articles that cover the same story or topic. Use clustering to:- Identify how different sources cover the same story.
- Spot emerging trends across large volumes of articles.
- Track how a story develops over time as cluster composition changes.
- Reduce manual organization work when processing high-volume news data.
Clustering is available for all languages supported by News
API.
How clustering works
Clustering involves two distinct stages that happen at different points in time: embedding generation, which happens as part of the data processing pipeline before any API request, and cluster formation, which happens at query time when you make a request with clustering enabled.Embedding generation (offline)
As part of the data processing pipeline, each article is converted into a dense vector — called an embedding — that represents its semantic meaning. These embeddings are computed and stored when the article is indexed, not when you make a request. Articles about the same topic produce embeddings that are close together in vector space, even when they use different words.
| Date range | Embedding model | Fields used |
|---|---|---|
| Before 2026-01-01 | multilingual-e5-large | Configurable: content, title, or summary |
| From 2026-01-01 onward | Qwen3-Embedding-0.6B | title + content (fixed) |
| Spans both periods | — | Returns an error |
Cluster formation (query time)
When you make a request withclustering_enabled=true, the backend service
retrieves the pre-computed embeddings for the articles that match your query,
then runs the
Leiden graph community detection algorithm
to group them into clusters:
- The cosine similarity between each pair of article embeddings is calculated.
- Article pairs whose similarity score exceeds the
clustering_thresholdare connected as edges in a similarity graph. - The Leiden algorithm detects communities within that graph.
- Each detected community becomes a cluster with a unique
cluster_id.

Parameters
To enable clustering and control its behavior, include the following parameters in your request:| Parameter | Type | Default | Description |
|---|---|---|---|
clustering_enabled | boolean | false | Set to true to enable clustering. |
clustering_threshold | float | 0.7 | Minimum cosine similarity required for two articles to be placed in the same cluster. Accepts values from 0 to 1. Higher values produce smaller, more tightly related clusters. |
clustering_variable | string | content | Deprecated from January 1, 2026 onward. For pre-2026 data, specifies which field is used for embeddings: content, title, or summary.For post-2026 data, this parameter is ignored — clustering always uses title + content. |
Choosing a threshold
Theclustering_threshold value controls the trade-off between cluster size and
topical precision:
| Value | Effect |
|---|---|
0.6 | Larger clusters; more topically diverse |
0.7 | Balanced cluster size and similarity (default) |
0.8 | Smaller clusters; more tightly related articles |
Set page size for effective clustering
Clustering operates on one page of results at a time. If related articles are split across multiple pages, they are clustered independently and may end up in separate clusters. To ensure that all related articles are considered together, setpage_size to
a value greater than or equal to the expected number of results for your query.
For example, if your query is likely to return 150 articles, set page_size to
at least 150.
Response structure
When clustering is enabled, the API response includes the following fields at the top level:clusters_count: The total number of clusters found.clusters: An array of cluster objects.
cluster_id: A unique identifier for the cluster.cluster_size: The number of articles in the cluster.articles: An array of article objects belonging to this cluster.
Code example
The following example searches for articles about renewable energy with clustering enabled, then prints a summary of each cluster.clustering.py
Work with embeddings directly
The same Qwen3 embeddings used for clustering are also available in the API response for you to use in your own pipelines. This is useful when the built-in clustering does not match your use case, or when you need to work with more articles than fit in a single request. Common use cases include:- Semantic search over your own article corpus.
- Recommendation systems.
- Deduplication with custom similarity thresholds.
- Topic visualization.
- Domain-specific clustering algorithms.
Embeddings output requires the
v3_nlp_embeddings subscription plan.
Qwen3 embeddings is only available for articles indexed from January 1,
2026 onward.include_nlp_data=True. Each article’s
embedding is returned in article.nlp.qwen_embedding as an array of 1024 floats.
fetch_embeddings.py
scikit-learn, which is not included in
newscatcher-sdk. Install it separately:
- Semantic similarity
- Custom clustering
- Dimensionality reduction
Find articles most similar to a reference article without making additional
API calls:
semantic_similarity.py
Clustering vs. deduplication
Clustering and deduplication both help organize large sets of articles, but they serve different purposes:| Clustering | Deduplication | |
|---|---|---|
| Purpose | Groups related articles | Removes near-identical articles |
| Output | Groups of related articles | A set of unique articles |
| Similarity threshold | Lower — allows broader groupings | Higher — identifies near-exact matches |
| Effect on article count | Retains all articles | Removes duplicates |
| Best for | Trend analysis, multi-source coverage | Removing redundancy, ensuring uniqueness |

