The internet is filled with media tricks such as clickbait articles, unnecessarily long texts, false information, and many more that waste the human’s most valuable resource - time. People need to be informed in concise, short articles that provide high informational value. This post explores different clustering models paired with summarization algorithms for creating a news aggregator website.
Contents
- General idea
- Scraping the news data
- Clustering algorithms
– K-means
– DB-SCAN
– Mean shift - Comparing the clustering algorithms
- Text summarization
– Extractive algorithms
— Topic words
— Text Rank
— Latent semantic analysis
– Abstractive algorithms - Comparing the summarization algorithms
- Final notes
- References
General idea
Press enter or click to view image in full size
The main idea of this aggregation task is to transform how news is presented, so that they are short, informative, and ranked by importance. This can be achieved by showing the most popular clusters of information first and displaying the top 3 sentences that best describe that cluster. The problem can be split into five distinct parts:
- Getting the news from independent sources
- Encoding it into machine-understandable vectors
- Applying clustering algorithms
- Summarizing the text of all articles for each cluster
- Presenting the final data to the user
Scraping the data
I have compiled a list of around 40 RSS URLs of the most popular media sources in Macedonian. The URLs are for local, economic, political, and world news. Each URL on average contains 20 news articles, so in the end, there are approximately 800 articles for further processing. The scraping is done with BeautifulSoup4 and regex expressions for cleaning the text. The whole process is very straightforward so I will not be including the code here, but you can find it on my GitHub.
Clustering algorithms
Clustering algorithms are unsupervised machine learning models tasked with grouping similar data points together. In this news aggregation problem, that would mean grouping news articles that belong to the same topic. Several algorithms can be used for this task.
Press enter or click to view image in full size
Before using any specific algorithm, the computer needs to “understand” the news article text. Computers cannot process natural language text as humans do, so it has to be encoded in a vector. Many algorithms and machine learning models do just this, for example, Bag of words, TFIDF, GloVe, Doc2Vec, BERT, and many more. This is the area of study of Natural Language Processing (NLP). For simplicity, I am using the Term frequency-inverse document frequency (TF-IDF) algorithm. Although not as advanced as the more complex machine learning models, it is sufficient for this problem.
TF-IDF is composed of two parts:
- Term frequency (TF) - measures the frequency of a term in the text.
- Inverse document frequency (IDF) - measures how important that term is. This allows the algorithm to not give much importance to common words.
The final value is calculated as the ratio between these two values. Implementing this algorithm is straightforward using sklearn.
K-Means
K-Means is one of the most popular clustering algorithms. It aims to group several data points into K clusters. It works by first randomly initializing K centroids, and using repetitive calculations, it optimizes the centroid’s location.
However, the algorithm has a major problem for this specific news aggregation task. The number of clusters (K) needs to be set explicitly before running the algorithm. On some days there are much more news topics than on others, so we can’t know what that number will be in advance. With testing, I found that on average 40 clusters are ideal for this problem. This configuration successfully clusters most of the popular news topics. It fails on the topics that are not very popular, by producing clusters that are a mix of more than one news topic.
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is another popular algorithm for clustering data points. It groups data points that are close to each other (by euclidean distance) and it is very robust in working with outliers. DBSCAN has two major parameters that affect the clustering:
- eps - the maximum distance between two data points for one to be considered as in the neighborhood of the other
- min_samples - the minimum amount of data points for them to be considered as a new cluster
All other data points that are not in the clusters are considered outliers. Unlike K-means, DBSCAN’s number of clusters is dynamic and it doesn't need to be set beforehand. For this specific task, I found it best to have an eps value of 1.05 and a minimum sample size of 3. This produces around 50 clusters, with around 35% of the data points as outliers.
Mean shift
Mean shift is an unsupervised clustering algorithm used mainly with image processing. It uses the “hill-climb” approach of shifting the kernels of the clusters iteratively to a more dense region until convergence or the maximum amount of iterations is reached. Just like DBSCAN, it does not need the number of clusters in advance.
Mean shift is a very computationally expensive algorithm. It is usually used for image processing tasks. Using this algorithm with text data produced below-average results, while taking a lot more time for processing.
Comparing the clustering algorithms
Press enter or click to view image in full size
K-Means is a versatile algorithm that produces great results. It can be used effectively for news aggregation if only the top N clusters are shown to the user, as less dense clusters tend to contain news from different topics.
Even though DBSCAN is not as widely used as K-Means, it produced better results. Its robustness to outliers and the not having a predefined number of clusters make it a better fit for this task.
Mean shift is an algorithm that yields great results for specific tasks like image processing. This algorithm fails to achieve satisfactory results in clustering different topics in news. Additionally, the required processing power and time would make this algorithm unfeasible in many production environments.
For further working with this model, I decided to use the DBSCAN algorithm, as it managed to cluster the news data very well, it doesn't need the number of clusters in advance and it is relatively fast.
Text summarization
The task of automatic text summarization aims to shorten the overall length of a document while retaining the most essential information. The goal is that the reader can get the same informational value in a quicker, less biased way. There are two major types of automatic text summarization depending on the output text.
Extractive summarization
Extractive summarization algorithms use sentences that are already in the text to form a summary. The general idea is that sentences are ranked by “importance”, and only the top N sentences are used. Calculating this “importance” differentiates extractive algorithms apart. There are a few algorithms that can be used.
Topic words
The topic words algorithm is the simplest algorithm for extractive summarization. It calculates the importance of sentences based on how many words of the sentence are in a set of important words for a specific topic.
For example, if I want to summarize a text about climate change using the topic words algorithm I would have a set of common words for this topic like “pollution”, “climate”, “warming” and so on. Sentences can be ranked by the number of words that are in this set, or the density of these words in the sentence.
Unfortunately, this algorithm cannot be used for this kind of text aggregation. The topics in this task are dynamic and there is no way of gathering the set of important words in advance.
Text rank
The text rank algorithm is a popular graph-based algorithm that can be effectively used for generating summaries. It is based on the more popular “Page rank” algorithm that is utilized by search engines to decide which web pages rank first.
This algorithm works by constructing a graph where the vertices represent each sentence and the edges represent the content overlap between two sentences. This content overlap value can be calculated by the number of words that are in both sentences, or more commonly, by calculating the cosine similarity of the sentence vectors.
The following code shows the implementation in python.
Additionally, news sources often reuse sentences or write very similar ones in multiple news stories. It would be a problem if the top three sentences shown to the user are semantically identical. The following code removes sentences that relate more than 70% to each other.
Latent semantic analysis (LSA)
Latent semantic analysis is a complex algebraic-statistical method that extracts hidden semantics in words and sentences. It finds out which words or phrases are used together and which common words appear in different sentences. A high number of common words indicate high similarity between sentences. Singular Value Decomposition is the algebraic method used for finding the relations between these words and sentences.
For the python implementation, I have used a slightly modified version of this LSA summarization implementation, changing the stopwords parameter as the articles used here are in the Macedonian language. The additional code for removing duplicates is very similar to the TextRank algorithm, so I will not be including it here, but you can find it on my GitHub.
Abstractive algorithms
Abstractive summarization algorithms generate new text that is not present in the input text data. These algorithms are part of the more general Seq2Seq models that can be used for machine translation, extending text, and so on. Abstractive models are much more complex than extractive ones.
These models are generally built with complex neural networks, using LSTMs, and are trained for a long time. Other than the high complexity and processing required, they require output text summary data. This data is used for evaluating and tuning the models to the needs. However, for this aggregation task, this data is not present.
Another option is to use more general pre-trained large language models, like GPT-2, RoBERTa and others. These models are usually trained on English text data and very few work with Macedonian.
An exciting new take on text summarization is BLOOM by BigScience. It is a large language model that can also be used for text summarization. The general purpose is to continue predicting text that would naturally go after the input. The input text data can be fed into it, with a final phase “Summary:” and it will continue to write the summary. I tested it with English data and it managed to generate a great, semantically correct summary. However, even though it is trained on a small amount of Macedonian text data, it does not work well with the Macedonian language.
Comparing the summarization algorithms
Summarization algorithms can be evaluated using metrics like ROUGE, but these metrics do require an ideal output summary for comparison.
The topic words algorithm is a very simple but effective algorithm for generating a summary. However, because the topics are dynamic and changing each day this algorithm cannot be used for this task.
Text Rank and LSA generate nearly identical summaries on the topics and capture the most important information very well. Computationally, the LSA algorithm uses fewer resources and requires less time, so I chose it for the production environment.
Abstractive methods, even though exciting, are very complex and require a lot of data for training. Additionally, the lack of text data and models in Macedonian, makes it even harder to achieve satisfactory results.
Final notes
In this post, we explored different clustering and summarization algorithms for aggregating news. An area where these summarization models can be improved is using ideal summary output data. This data can be used for evaluating, fine-tuning, and using more complex abstractive algorithms.
I have created a website that shows the top clusters of Macedonian news each with three bullet points of summary data (the most important sentences). The website is using DBSCAN clustering with LSA summarization. It is available on this link.
The full code of this article, with the flask backend and react frontend can be found on this link.