Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

They also revealed poor performance of neural networks on such tasks
Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.
Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.
This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.
Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’
However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.
Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts.
Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.
Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:
- ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
- Two models—GSM and WTM-GMM—are neural topic models.
- W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
- GLDAW relies on a broader collection of embeddings to determine the association of words with topics.
For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.
Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics
Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.
The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.
The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.
Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics
We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.
GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.
See also:
Scientists Discover Why Parents May Favour One Child Over Another
An international team that included Prof. Marina Butovskaya from HSE University studied how willing parents are to care for a child depending on the child’s resemblance to them. The researchers found that similarity to the mother or father affects the level of care provided by parents and grandparents differently. Moreover, this relationship varies across Russia, Brazil, and the United States, reflecting deep cultural differences in family structures in these countries. The study's findings have been published in Social Evolution & History.
When a Virus Steps on a Mine: Ancient Mechanism of Infected Cell Self-Destruction Discovered
When a virus enters a cell, it disrupts the cell’s normal functions. It was previously believed that the cell's protective response to the virus triggered cellular self-destruction. However, a study involving bioinformatics researchers at HSE University has revealed a different mechanism: the cell does not react to the virus itself but to its own transcripts, which become abnormally long. The study has been published in Nature.
Researchers Identify Link between Bilingualism and Cognitive Efficiency
An international team of researchers, including scholars from HSE University, has discovered that knowledge of a foreign language can improve memory performance and increase automaticity when solving complex tasks. The higher a person’s language proficiency, the stronger the effect. The results have been published in the journal Brain and Cognition.
Artificial Intelligence Transforms Employment in Russian Companies
Russian enterprises rank among the world’s top ten leaders in AI adoption. In 2023, nearly one-third of domestic companies reported using artificial intelligence. According to a new study by Larisa Smirnykh, Professor at the HSE Faculty of Economic Sciences, the impact of digitalisation on employment is uneven: while the introduction of AI in small and large enterprises led to a reduction in the number of employees, in medium-sized companies, on the contrary, it contributed to job growth. The article has been published in Voprosy Ekonomiki.
Lost Signal: How Solar Activity Silenced Earth's Radiation
Researchers from HSE University and the Space Research Institute of the Russian Academy of Sciences analysed seven years of data from the ERG (Arase) satellite and, for the first time, provided a detailed description of a new type of radio emission from near-Earth space—the hectometric continuum, first discovered in 2017. The researchers found that this radiation appears a few hours after sunset and disappears one to three hours after sunrise. It was most frequently observed during the summer months and less often in spring and autumn. However, by mid-2022, when the Sun entered a phase of increased activity, the radiation had completely vanished—though the scientists believe the signal may reappear in the future. The study has been published in the Journal of Geophysical Research: Space Physics.
Banking Crises Drive Biodiversity Loss
Economists from HSE University, MGIMO University, and Bocconi University have found that financial crises have a significant negative impact on biodiversity and the environment. This relationship appears to be bi-directional: as global biodiversity declines, the likelihood of new crises increases. The study examines the status of populations encompassing thousands of species worldwide over the past 50 years. The article has been published in Economics Letters, an international journal.
Scientists Discover That the Brain Responds to Others’ Actions as if They Were Its Own
When we watch someone move their finger, our brain doesn’t remain passive. Research conducted by scientists from HSE University and Lausanne University Hospital shows that observing movement activates the motor cortex as if we were performing the action ourselves—while simultaneously ‘silencing’ unnecessary muscles. The findings were published in Scientific Reports.
Russian Scientists Investigate Age-Related Differences in Brain Damage Volume Following Childhood Stroke
A team of Russian scientists and clinicians, including Sofya Kulikova from HSE University in Perm, compared the extent and characteristics of brain damage in children who experienced a stroke either within the first four weeks of life or before the age of two. The researchers found that the younger the child, the more extensive the brain damage—particularly in the frontal and parietal lobes, which are responsible for movement, language, and thinking. The study, published in Neuroscience and Behavioral Physiology, provides insights into how age can influence the nature and extent of brain lesions and lays the groundwork for developing personalised rehabilitation programmes for children who experience a stroke early in life.
Scientists Test Asymmetry Between Matter and Antimatter
An international team, including scientists from HSE University, has collected and analysed data from dozens of experiments on charm mixing—the process in which an unstable charm meson oscillates between its particle and antiparticle states. These oscillations were observed only four times per thousand decays, fully consistent with the predictions of the Standard Model. This indicates that no signs of new physics have yet been detected in these processes, and if unknown particles do exist, they are likely too heavy to be observed with current equipment. The paper has been published in Physical Review D.
HSE Scientists Reveal What Drives Public Trust in Science
Researchers at HSE ISSEK have analysed the level of trust in scientific knowledge in Russian society and the factors shaping attitudes and perceptions. It was found that trust in science depends more on everyday experience, social expectations, and the perceived promises of science than on objective knowledge. The article has been published in Universe of Russia.


