Semantic Hashtag LDA for Clustering Short Texts: Twitter Use Case.

Mir Saman Tajbakhsh, Jamshid Bagherzadeh, Hanif Emamgholizadeh.

Abstract: Topic Modelers (TM) have an acceptable rate in clustering texts, however, they have inefficient performance dealing with short texts, i.e. texts which are as long as a paragraph or somehow shorter. Twitter as a microblog allows its users to write messages up to 280 characters which a text with this length can be considered as a short text. Each tweet contains an event, either personal or non-personal, in which tweet is an information source. Therefore, using a clustering technique for categorizing the tweets enhance our ability in the information gaining process. Social Network providers, such as Twitter, try to persuade users to use hashtags, that is the word is preceded by # which shows the main topic, in order to facilitate the information gaining process. But, hashtags are user generated contents which exist in a variety of form even for the same topic, for example. #Rio2016 and #EqualPlayEqualPay were used during Rio 2016 Olympics event. In current research we have used a new topic modeler, namely Semantic Hashtag LDA, applied on larger documents, have been gathered by combining small documents of the shared topics. SHLDA has been developed and applied on tweets, have been collected from Twitter Stream API during 12 Oct 2017 to 21 Oct 2017, measured by topic coherence and compared with other methods of clustering, the results demonstrated a significant reduction in error rate up -0.02.

Share this page: