The hidden meanings of texts and tweets
A ‘spherical’ AI model finds hidden themes in large collections of headlines and other short texts.
From tweets and chat messages to headlines and status updates, short bursts of text are everywhere. These snippets may be brief, but they are packed with the potential to reveal anything from emerging trends to business decisions or circulating misinformation.
While AI is becoming increasingly adept at extracting meaning from long-form text, it still struggles with these bite-sized ones.
“For example, a short text like ‘Great battery life!’ or ‘Apple’ doesn’t provide enough information to determine whether the topic is a phone, a laptop, another device, or even a fruit,” says Jamal Bentahar from Khalifa University’s Department of Computer Science and 6G Research Center.
In traditional AI models designed to process long-form text, meaning is determined by how often words occur together and in what relationship, and on the surrounding context. This approach doesn’t work well with short texts where words are often omitted, relationships between words are highly nuanced and sparse, and context is not available.
“Great battery life!’ or ‘Apple’ doesn’t provide enough information to determine whether the topic is a phone, a laptop, another device, or even a fruit.”
Jamal Bentahar
“Short, sparse and fragmented texts like those on social media and news headlines are difficult for traditional AI models to work with because they lack sufficient word co-occurrence and context and suffer from high noise levels,” says Bentahar.
To help solve that problem, Bentahar, together with Hafsa Ennajari and Nizar Bouguila from Concordia University in Montreal, Canada, developed a new AI method that can accurately identify hidden themes in large datasets of text fragments.
Instead of traditional probabilistic mathematics, Bentahar and his colleagues used an alternative framework based on a mathematical construct called a hypersphere, which allows the use of spherical vector arithmetic. This setup allows the system to understand not just how often words occur together, but also how they connect to other words with related meanings, known as semantic clusters.
By further integrating contextual information from external sources, their spherical probabilistic topic model (called a Spherical Correlated Topic Model or SCTM) produces more accurate and interpretable topics from short texts than traditional AI models.
“We were able to create a model that captures topic correlations and integrates external knowledge to produce coherent, interpretable topics,” says Bentahar. “It has an efficient inference algorithm for scalability. It outperforms existing baseline methods on multiple datasets and also maintains stable performance across different numbers of topics, which is crucial for real-world applications.”
Bentahar points out that the SCTM could be a powerful new tool for analyzing social media, detecting trends and topics in real time, and summarizing customer feedback. Or even for uncovering hidden pattens or themes in the research literature, potentially speeding up discovery.
Reference
Ennajari, H., Bouguila, N. & Bentahar, J. Correlated topic modeling for short texts in spherical embedding spaces. IEEE Trans. Pattern Anal. and Mach. Intell. 47, 4567–4578, 2025. | Article