重庆理工大学学报(自然科学) ›› 2023, Vol. 37 ›› Issue (7): 245-255.

• 信息·计算机 • 上一篇    下一篇

融合内容特征与传播特征的微博文本情感分类

陈红阳,黄正洪,何盈盈   

  1. (1.重庆人文科技学院 计算机工程学院,重庆 401524; 2.成都理工大学 信息科学与技术学院,成都 610059)
  • 出版日期:2023-08-15 发布日期:2023-08-15
  • 作者简介::陈红阳,女,副教授,主要从事自然语言处理研究,Email:1060344275@qq.com;黄正洪,男,教授,主要从事模式 识别、数据挖掘与智能信息处理研究,Email:1624586681@qq.com。

Micro-blog text emotion classification based on the fusion of content features and spread features

  • Online:2023-08-15 Published:2023-08-15

摘要: 基于 Word2vec的文本向量化表示方法未充分考虑微博文本的内容特征与传播特 征,导致文本向量化表示欠佳,且采用单个机器学习算法进行情感分类的精度不高。提出一种 融合文本中表情符号,词的语义、词性与情感等内容特征,评论、转发与点赞数等传播特征,共同 构建蕴含丰富语义与情感信息的文本特征向量。根据各基分类器在训练数据集上的性能表现 设置不同权重,并与类概率向量相乘,保留最大、最小与平均加权概率值,同时结合原始文本特 征向量作为元分类器的输入数据以改进原 Stacking算法,进行微博文本情感分类。在微博数据 集上的实验结果表明:本文方法能更好地表示文本向量,以加权方式改进的 Stacking集成学习 分类器优于单个分类器;相较于其他情感分类方法,本文方法的准确率提升 1.75%~4.90%。

关键词: 微博文本, 情感特征, 词性特征, 传播特征, 情感分类

Abstract:

The text vector representation method based on Word2vec does not fully consider the content features and spread features of micro-blog texts, so it is not good enough to finish the micro-blog text vector representation. Besides, a single machine learning algorithm which is applied to classify the micro-blog text through emotions can’t provide a high accuracy of emotion classification. To further improve the effect of emotion classification for the micro-blog text, this paper proposes a new text vector representation method, which is combined with the improved Stacking ensemble learning algorithm to accomplish emotion classification for micro-blog text data in this paper.

At first, text feature vectors with rich semantic and emotional information are proposed to be constructed together by integrating text content features such as emoticons, semantic features of words, and part of speech and emotion, with the spread features such as comments, retweets and likes. Specifically, when constructing the initial text feature vector, this paper synthesizes the content features such as emoticons, word semantics, as well as part of speech and emotion. Meanwhile, it also constructs the corresponding feature vectors according to the above content features, and splices these vectors into the initial text feature based on content characteristics. Secondly, the influence of the text is constructed based on the spread features of the text, such as the number of comments, retweets and agreements. Finally, the influence of the micro-blog text is combined with the initial text feature vector to further enrich the semantic and emotional information contained in the vector representation of the micro-blog text.

Moreover, in the improved Stacking ensemble learning algorithm, combined with the initial training data set, four classification algorithms are selected, such as AdaBoost, random forest, GBDT and XGBoost. Then, a 5 fold cross-validation method is used to generate a high-performance base classifier. More importantly, the class probability vector is used instead of the class label output from the base classifier. Different weights are set and multiplied with the class probability vector according to the performance of the base classifiers on the training data set. After that, they are multiplied by the class probability vector to get the weighted class probability vector, retaining the maximum weighted probability values, the minimum weighted probability values and the average weighted probability values of each text predicted by all base classifiers belonging to each category. A simple and stable logistic regression algorithm is selected as the meta-classifier as well. At last, the original Stacking algorithm is improved by integrating the above weighted probability values as the input data of the meta-classifier with the original text feature vector so as to accomplish emotion classification of micro-blog text.

The experiment results on the data set of the micro-blog text show that the proposed method can better represent text vectors, and the improved Stacking ensemble learning classifier by the weight method is superior to the single emotion classifier. Compared with other emotion classification methods, the method proposed in this paper has made a performance improvement on the accuracy index from 1.75% to 4.90%, effectively improving the effect of emotion classification.

中图分类号: 

  • TP391.1