WO2020108430A1 - 一种微博情感分析方法及系统 - Google Patents

一种微博情感分析方法及系统 Download PDF

Info

Publication number
WO2020108430A1
WO2020108430A1 PCT/CN2019/120584 CN2019120584W WO2020108430A1 WO 2020108430 A1 WO2020108430 A1 WO 2020108430A1 CN 2019120584 W CN2019120584 W CN 2019120584W WO 2020108430 A1 WO2020108430 A1 WO 2020108430A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
microblog
weibo
sentiment
negative
Prior art date
Application number
PCT/CN2019/120584
Other languages
English (en)
French (fr)
Inventor
李博涵
万朔
王凯
张安曼
关东海
秦小麟
Original Assignee
南京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京航空航天大学 filed Critical 南京航空航天大学
Publication of WO2020108430A1 publication Critical patent/WO2020108430A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the invention relates to the technical field of natural language processing, in particular to a microblog sentiment analysis method and system.
  • the sentiment analysis of Weibo topics aims to explore people's views and attitudes on a topic or event on social networks.
  • the popularity of smart phones has enabled more and more people to access the Internet from mobile terminals and enter social networks.
  • Sina Weibo has more than 150 million daily active users, and the average daily number of Weibo posts reaches 200 million.
  • the massive amount of data in Weibo contains rich real-time information. People can push their life trends and opinions to Weibo, and they can also comment on popular events. These subjective data bring great convenience to the research of sentiment analysis.
  • the real-time and time-series emotional information mining of Weibo can accurately reflect the trend of Weibo topics and provide early warning, which has positive significance for individuals, enterprises and governments.
  • the data of Weibo has real-time and timeliness. Only by grasping the timeliness of Weibo information and analyzing the latest topic data can the value of the data be brought into full play.
  • most of the research on sentiment analysis of Weibo is committed to using deep learning methods to improve the classification performance of sentiment classifiers.
  • Most of the data sets used are the most typical Stanford Twitter sentiment analysis datasets in the field in English. There is no large-scale microblog data set for a specific topic or vertical time series analysis of a specific topic or field. Most studies are based on static sentiment analysis on existing data sets, and the timeliness is poor.
  • the purpose of the present invention is to provide a microblog sentiment analysis method and system, which can take into account the accuracy and timeliness of classification, and can accurately reflect the sentiment trend of the topic.
  • the present invention provides the following solutions:
  • a microblog sentiment analysis method includes:
  • Each of the target topic data is input to a Weibo sentiment classifier to obtain the sentiment type of each target topic data, the input of the Weibo sentiment classifier is Weibo text data, and the output of the Weibo sentiment classifier is Positive Weibo or negative Weibo; the method for establishing the Weibo sentiment classifier specifically includes:
  • Adopt general web crawler to collect several Weibo text data as classification training data
  • the characteristic emoji words of the microblog text and the characteristic emoticons include positive emoji words and negative emoji words;
  • the method before acquiring the characteristic emoticons of the microblog text, the method further includes:
  • the de-noising process specifically includes:
  • the method before selecting equal amounts of positive microblog data and negative microblog data to form a corpus, the method further includes:
  • the negative microblog data with the positive emotion word is filtered out.
  • the method before inputting the target topic data into the Weibo sentiment classifier, the method further includes:
  • the target topic data is cleaned by using the Weibo topic constraint model to obtain the target topic data after cleaning.
  • the method further includes:
  • the sentiment types of each of the target topic data are arranged on the time axis according to the release time of the corresponding target topic data.
  • a microblog sentiment analysis system includes:
  • the target topic data collection module is used to collect several microblog text data of the target topic within a preset time period as the target topic data by using the focused web crawler;
  • a sentiment analysis module for inputting each target topic data into a Weibo sentiment classifier to obtain the sentiment type of each target topic data, the input of the Weibo sentiment classifier is microblog text data, The output of the sentiment classifier is a positive Weibo or a negative Weibo; the establishment subsystem of the sentiment classifier of the Weibo specifically includes:
  • Classification training data collection module used to collect several Weibo text data as general training data using general web crawler
  • the characteristic expression word acquisition module is used to obtain characteristic expression words of the microblog text, and the characteristic expression words include positive expression words and negative expression words;
  • a microblog data classification module for classifying the classification training data using the characteristic expression words to obtain positive microblog data and negative microblog data, the positive microblog data being microblog data with positive expression words ,
  • the negative microblog data is microblog data with negative expressions;
  • a corpus building module used to select an equal number of positive Weibo data and negative Weibo data to form a corpus
  • a classifier training module is used to train the fastText classifier using the corpus to obtain the Weibo sentiment classifier.
  • the establishment subsystem of the Weibo sentiment classifier further includes:
  • the denoising processing module is used for denoising the classification training data to obtain the denoising processed training data.
  • the denoising processing specifically includes:
  • the establishment subsystem of the Weibo sentiment classifier further includes:
  • the first judgment module is used to judge whether there is a negative emotion word in the emotion polarity dictionary in the positive microblog data to obtain a first judgment result;
  • a first filtering module configured to filter out positive microblog data with negative emotion words when the first judgment result indicates that negative emotional words in an emotional polarity dictionary exist in the positive microblog data;
  • a second judgment module used to judge whether there is a positive emotion word in the emotion polarity dictionary in the negative microblog data, to obtain a second judgment result
  • the second filtering module is configured to filter out negative microblog data with positive emotion words when the second judgment result indicates that there are positive emotion words in the emotional polarity dictionary in the negative microblog data.
  • the Weibo sentiment analysis system further includes:
  • the constrained training data selection module is used to randomly select microblog text data with the same number of target topic data as the constrained training data
  • a constraint model determination module for training the fastText classifier using the constraint training data to obtain a Weibo topic constraint model
  • the irrelevant topic cleaning module is configured to perform irrelevant topic cleaning on the target topic data using the Weibo topic constraint model to obtain the cleaned target topic data.
  • the Weibo sentiment analysis system further includes:
  • the timing analysis module is used to arrange the sentiment types of the target topic data on the time axis according to the release time of the corresponding target topic data.
  • the present invention discloses the following technical effects:
  • the microblog sentiment analysis method and system provided by the present invention adopts a focused web crawler to collect several microblog text data of a target topic within a preset time period as target topic data, and inputs each target topic data into a microblog sentiment classifier, Obtain the emotion type of each target topic data.
  • the invention adopts a weakly supervised learning method based on expression words and emotion words to filter emotional microblogs, selects equal numbers of positive microblog data and negative microblog data to construct a million-scale Chinese microblog corpus, and uses the corpus to
  • the microblog sentiment classifier obtained by fastText classifier training can take into account the accuracy and timeliness of the classification, and can accurately reflect the emotional trend of the topic.
  • FIG. 1 is a flowchart of a microblog sentiment analysis method provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for establishing a microblog sentiment classifier provided by an embodiment of the present invention
  • FIG. 3 is a structural block diagram of a microblog sentiment analysis system provided by an embodiment of the present invention.
  • FIG. 4 is a structural block diagram of a building subsystem of a microblog sentiment classifier provided by an embodiment of the present invention
  • FIG. 7 is an overall framework diagram of a microblog sentiment classifier provided by an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of time series sentiment analysis using days as the time granularity provided by an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of time series sentiment analysis using hour as a time granularity according to an embodiment of the present invention.
  • the purpose of the present invention is to provide a microblog sentiment analysis method and system, which can take into account the accuracy and timeliness of classification, and can accurately reflect the sentiment trend of the topic.
  • FIG. 1 is a flowchart of a microblog sentiment analysis method provided by an embodiment of the present invention. As shown in FIG. 1, a microblog sentiment analysis method, the analysis method includes:
  • Step 101 Use focused web crawlers to collect several microblog text data of a target topic within a preset time period as target topic data.
  • Focused web crawler focuses on a target topic, and achieves the acquisition of Weibo text within a specific time period for a specific topic.
  • Both historical microblog data of the topic can be collected, and real-time data of the topic can be collected on the same day.
  • the data is used for time-series real-time sentiment analysis of vertical topics.
  • Step 102 Input each target topic data into a Weibo sentiment classifier to obtain the sentiment type of each target topic data, the input of the Weibo sentiment classifier is Weibo text data, and the Weibo sentiment classifier 'S output is positive Weibo or negative Weibo.
  • step 102 is performed: before inputting each of the target topic data into the Weibo sentiment classifier, the method further includes performing denoising on the target topic data to obtain target topic data after denoising, and the denoising process It includes: filtering out emoticons and symbols in Weibo text data; using regular expressions to match and filter Uniform Resource Locator (URL) links and mailboxes; filtering out Weibo whose character length is less than the set threshold text data.
  • URL Uniform Resource Locator
  • the method before performing step 102: before inputting each of the target topic data into the Weibo sentiment classifier, the method further includes:
  • the microblog text data that is randomly selected from the classification training data and has the same number of target topic data as constraint training data.
  • the Weibo topic constraint model is actually a classification model, used to classify the target topic data, divided into topic-related microblogs and topic-unrelated microblogs, and filtering out the noises of topic-unrelated microblogs.
  • the step 102 is performed: after inputting each target topic data into the Weibo sentiment classifier, and obtaining the emotion type of each target topic data, the method further includes:
  • the sentiment types of each of the target topic data are arranged on the time axis according to the release time of the corresponding target topic data, so as to facilitate time-series analysis of the classification results. Displaying the classification results on the time axis can realize the sentiment analysis of different time granularities such as days and hours, so as to understand how a topic's sentiment changes with time.
  • FIG. 2 is a flowchart of a method for establishing a microblog sentiment classifier provided by an embodiment of the present invention. As shown in FIG. 2, the establishment method of the Weibo sentiment classifier specifically includes:
  • Step 201 Use a general web crawler to collect several microblog text data as classification training data.
  • a general web crawler is used to collect a large amount of microblog text data, which uses multi-threading and proxy technology to achieve high concurrent crawling of 580,000 articles/day of microblog text.
  • the collected classification training data is used for the emotion classifier. training.
  • Step 202 Obtain the characteristic expression words of the microblog text, and the characteristic expression words include positive expression words and negative expression words.
  • Micro-blogs are classified using characteristic expression words with strong emotional colors. Micro-blogs with positive expressions are classified as positive micro-blogs, and micro-blogs with negative expressions are classified as negative micro-blogs.
  • Step 203 Use the characteristic expression words to classify the classification training data to obtain positive microblog data and negative microblog data.
  • the positive microblog data is microblog data with positive expression words
  • the negative microblog Blog data is Weibo data with negative expressions.
  • Step 204 Select equal amounts of positive microblog data and negative microblog data to form a corpus.
  • a weakly supervised learning method is used to extract 4.2 million positive emotional microblogs and 680,000 negative emotional microblogs from the data set. Randomly select the microblogs with the same number of negative microblogs from the positive emotional microblog set, which constitutes the corpus of Chinese Weibo sentiment analysis weibo_sentiment_corpus, which is used for the next training of emotion classifier.
  • this embodiment strips off emoticons contained in each microblog in the corpus.
  • this embodiment does not clean up the stop words .
  • Step 205 Use the corpus to train the fastText classifier to obtain the Weibo sentiment classifier.
  • the word vector length of the fastText classifier with the highest final classification accuracy is 300 dimensions, which is 92.2% accurate rate.
  • the accuracy of the classifier can be further improved by increasing the dimension of the word vector.
  • the method before performing the step 202: before obtaining the characteristic emoticons of the microblog text, the method further includes:
  • the de-noising process specifically includes: filtering out the emoticons and symbols in the Weibo text data; using regular expressions to locate uniform resources Character (Uniform Resource Locator, URL) links and mailboxes are matched and filtered; filter out microblog text data whose character length is less than the set threshold.
  • uniform resources Character Uniform Resource Locator, URL
  • Step 204 Before selecting equal amounts of positive microblog data and negative microblog data to form a corpus, the method further includes:
  • the negative microblog data with the positive emotion word is filtered out.
  • the characteristic expression words with strong emotional colors are manually selected, as shown in Table 1, which contains 18 typical negative expression words and 37 typical positive emoticons.
  • the NTUSD sentiment dictionary is also used as a double filter for Weibo. If a Weibo contains emotion words that are different from the emotion color of the emoticon, it will also be filtered out.
  • a micro-blog is divided into emotional micro-blogs.
  • the micro-blog must contain characteristic emoticons, and use regular expression ⁇ [[a-zA-z ⁇ u4e00- ⁇ u9fff] ⁇ 1,5 ⁇ ] to get the text
  • ⁇ [[a-zA-z ⁇ u4e00- ⁇ u9fff] ⁇ 1,5 ⁇ ] to get the text
  • the microblog text contains emoji words and contains only one type of emoji words, for example, only contains positive emoji words, then determine whether other words in the microblog have an intersection with the negative emotion words in the sentiment dictionary , If it does not contain negative emotion words, then the micro-blog is classified as a positive micro-blog.
  • the algorithm of the entire filtering process is as follows,
  • FIG. 3 is a structural block diagram of a microblog sentiment analysis system provided by an embodiment of the present invention. As shown in FIG. 3, a microblog sentiment analysis system, the analysis system includes:
  • the target topic data collection module 301 is configured to use focused web crawlers to collect several microblog text data of the target topic within a preset time period as target topic data.
  • the sentiment analysis module 302 is used to input each target topic data into a Weibo sentiment classifier to obtain the sentiment type of each target topic data.
  • the input of the Weibo sentiment classifier is Weibo text data, the micro
  • the output of the Bo sentiment classifier is positive Weibo or negative Weibo.
  • the Weibo sentiment analysis system further includes:
  • the constrained training data selection module is used to randomly select microblog text data with the same number of target topic data as the constrained training data
  • a constraint model determination module for training the fastText classifier using the constraint training data to obtain a Weibo topic constraint model
  • the irrelevant topic cleaning module is configured to perform irrelevant topic cleaning on the target topic data using the Weibo topic constraint model to obtain the cleaned target topic data.
  • the Weibo sentiment analysis system further includes:
  • the timing analysis module is used to arrange the sentiment types of the target topic data on the time axis according to the release time of the corresponding target topic data.
  • FIG. 4 is a structural block diagram of the establishment subsystem of the microblog sentiment classifier provided by an embodiment of the present invention. As shown in FIG. 4, the establishment subsystem of the Weibo sentiment classifier includes:
  • the classification training data collection module 401 is used for collecting several microblog text data as classification training data by using a general web crawler.
  • the characteristic expression word obtaining module 402 is used to obtain characteristic expression words of the microblog text, and the characteristic expression words include positive expression words and negative expression words;
  • the microblog data classification module 403 is used to classify the classification training data by using the characteristic expression words to obtain positive microblog data and negative microblog data.
  • the positive microblog data is a microblog with positive expression words Data
  • the negative microblog data is microblog data with negative expressions;
  • a corpus construction module 404 used to select an equal number of positive Weibo data and negative Weibo data to form a corpus
  • a classifier training module 405 is used to train the fastText classifier using the corpus to obtain the Weibo sentiment classifier.
  • the establishment subsystem of the Weibo sentiment classifier further includes:
  • the denoising processing module is used for denoising the classification training data to obtain the denoising processed training data.
  • the denoising processing specifically includes:
  • the establishment subsystem of the Weibo sentiment classifier further includes:
  • the first judgment module is used to judge whether there is a negative emotion word in the emotion polarity dictionary in the positive microblog data to obtain a first judgment result;
  • a first filtering module configured to filter out positive microblog data with negative emotion words when the first judgment result indicates that negative emotional words in an emotional polarity dictionary exist in the positive microblog data;
  • a second judgment module used to judge whether there is a positive emotion word in the emotion polarity dictionary in the negative microblog data, and obtain a second judgment result
  • the second filtering module is configured to filter out negative microblog data with positive emotion words when the second judgment result indicates that there are positive emotion words in the emotional polarity dictionary in the negative microblog data.
  • the implementation process of the microblog sentiment analysis system provided by the present invention is as follows:
  • the Weibo Focus Crawler combines the Weibo application program interface to obtain real-time data and historical data of Weibo on a specific topic as target topic data.
  • the target topic data contains the time information published by each Weibo, which is used for timing analysis later.
  • S301 Use regular expressions to match the most common @ and # symbols in Weibo, clean out the usernames attached to @ and @, and filter out all tags represented by # and #;
  • S302 Use regular expressions to match and filter the URL links and email addresses in the Weibo text. After statistics, 670,000 of the 35 million Weibo texts collected contain URL links and email addresses, that is, every 100 Two of the data need to be cleaned.
  • S303 Disassemble the common emoticons on the network to obtain a dictionary of special characters, and use the dictionary to filter the special characters in the microblog text;
  • the cleaning of Weibo text has the following four steps: Weibo specific attribute cleaning, URL link and mailbox cleaning, special character cleaning and short Weibo cleaning.
  • the source data has a total size of 6.34GB.
  • Figure 6 shows the size of the remaining data after each cleanup. There are a lot of @ and ## symbols on Sina Weibo, which are used to refer to someone or tag Weibo. These unique attributes will bring noise in the training of the classifier, and due to the limitation of the length of the Weibo The lifting of the ban, a microblog may contain multiple tags, if you do not remove the tags, the later classifier training will be assigned a greater weight for the tags.
  • the last pre-processing step is the filtering of short microblogs. After the cleaning of the above steps, the length of many microblog texts will be shortened. Microblogs with a character length of less than 5 are set as invalid microblogs. One Chinese character is calculated according to one character . After filtering, 2.28 million invalid short microblogs were filtered out. Finally, 33.48 million valid microblogs remained in the dataset, with a total size of 5.21GB.
  • microblog feature expressions and emotional polarity dictionaries to conduct weakly supervised learning training on the data collected by the general web crawler, filter out microblogs with strong emotional colors, and serve as a corpus for microblog sentiment analysis.
  • the microblog emoticons in step S4 are positive emoticons and negative emoticons with strong emotional pornography, and the NTpolar emotion dictionary is used in the emotion polarity dictionary.
  • the NTpolar emotion dictionary is used in the emotion polarity dictionary.
  • the final filtered weibo_sentiment_corpus contains stop words and does not follow the steps of filtering stop words in traditional text cleaning.
  • the classifier based on the training set with stop words has better accuracy than none
  • the classifier obtained by the stopword training is 0.4% higher.
  • the microblog text in the corpus is text that has filtered out emoticons, which avoids that emoticons are given greater weight in the process of training the classifier, which affects the accuracy of classification.
  • Weibo sentiment classifier In the training of Weibo sentiment classifier, 80% of Weibo in Weibo_senti-ment_corpus is selected as the training set, and 20% of Weibo as the test set. According to the test results of the test set, the classification accuracy of the classifier has reached 92.2%.
  • step S7 Use the sentiment classifier generated in step S5 to classify the Weibo topic data filtered in step S6, including:
  • step S702 For the latest data collected in S701, the microblog topic constraint model in step S6 is used to filter noise microblogs with unrelated topics;
  • step S3 For the target topic microblog obtained by filtering in S702, step S3 is used for cleaning, and then stored in the database;
  • S704 For the cleaned data in S703, use the Weibo sentiment classifier trained in step S5 to classify, and then synchronize the classification results to the database in chronological order.
  • step S8 dynamically displaying the classification results in step S7 in chronological order, thereby realizing real-time sequential sentiment analysis of specific topics.
  • the time series sentiment analysis on the specific topic of Weibo in step S8 belongs to the analysis of the application layer. Read all the classification results of the target topic Weibo from the database, and then draw the emotion classification results into a graph in chronological order.
  • the axis of the graph is the time axis, and the axis is the number of microblogs. There are two curves in the graph. The curve above the axis represents the degree of positive emotions of a particular topic over time, and the curve below the axis represents the degree of negative emotions of the topic over time.
  • the microblog sentiment classifier obtained by the microblog sentiment analysis method proposed by the present invention is divided into four layers, namely: a data acquisition and preprocessing layer, a model layer, a data storage layer, and an application layer.
  • the general web crawler and the focused web crawler are responsible for data collection. After a short period of storage, the collected data is preprocessed.
  • the Universal Reptile collected a total of 35 million Weibo posts, with a total file size of 6.34GB. Focused web crawlers can collect historical data and real-time data on any particular topic. This example selects the most popular topic in April 2018: ZTE Crisis, as an example to illustrate, and also uses the focused web crawler to collect the history related to ZTE topics from January 1, 2018 to May 1, 2018 A total of 38,000 Weibo texts.
  • the data set generation model uses a weakly supervised learning method based on feature expression words and NTUSD sentiment dictionary, filtering 4.2 million positive emotion microblogs and 680,000 negative emotion microblogs. Since the overall sentiment of Weibo is more positive, and the number of positive feature expressions is twice that of negative expressions, the number of finally extracted emotional positive Weibos is much larger than that of negative Weibo.
  • a microblog is given a larger prior probability of positive emotions, so randomly selected microblogs equal to the number of negative microblogs from the set of positive positive microblogs constitute Chinese A corpus of sentiment analysis on Weibo, used to train sentiment classifiers.
  • the correlation range ⁇ is: 0 ⁇ 1.
  • Every two words are separated from each other, and it is not based on the context to infer the semantics of a word. .
  • the present invention is based on a large-scale data set. If the topic is narrowed to a specific field, the classification performance of the classifier should be further improved.
  • the present invention analyzes ZTE’s topic Weibo in detail.
  • Focus Web crawler collected 38,000 ZTE topics' microblogs, with an average of 310 microblogs per day and 13 microblogs per hour. This amount of data is sufficient to support time series sentiment analysis in units of days and hours. 38,000 microblogs were randomly selected from the focused crawler. These microblogs are most likely to be unrelated to ZTE topics. As an unrelated topic in the training set and 38,000 ZTE topic microblogs, they are sent to fastText for topic constraint model. Training. After the model is trained, re-classify 38,000 ZTE microblogs. If the relevance to ZTE topics is greater than or equal to 60%, that is, the probability of belonging to ZTE topics is greater than or equal to 60%, then the microblog As a topic related Weibo.
  • the above analysis proves the feasibility of vertical timing analysis of Weibo topics. Different from the heat curves of Baidu Index and Weibo Micro Index, the analysis method proposed by the present invention not only takes into account the popularity of topic events, but also realizes the dynamic timing analysis of positive emotions and negative emotions, which can quickly and intuitively reflect the emotional changes of topics . This also proves the effectiveness and practicability of the microblog sentiment classifier trained by the present invention from an application perspective.
  • the invention realizes the real-time and time-series sentiment analysis of microblog vertical topics, while taking into account the accuracy rate of the microblog sentiment classifier, and enhances the real-time and timeliness of the sentiment analysis of the microblog.
  • the invention constructs a million-level Chinese microblog sentiment analysis corpus, which is currently the largest corpus in the field.
  • the invention overcomes the problem of vector sparseness based on the bag of words model, and uses fastText to train distributed word vectors and sentiment classifiers, so as to learn more semantics of microblog short texts.
  • the topic constraint model of microblog proposed by the present invention realizes the filtering of noise microblogs in the microblog data set of specific topics.
  • the experimental results show that the accuracy rate of the Weibo sentiment classifier provided by the present invention reaches 92.2%, and the time-series sentiment analysis of the Weibo topic realized on this basis can also accurately reflect the sentiment trend of the topic.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开一种微博情感分析方法及系统。本发明提供的微博情感分析方法及系统,采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据,将各个目标话题数据输入微博情感分类器,即可获得各个目标话题数据的情感类型。本发明采用基于表情词和情感词的弱监督学习方法进行情感微博的过滤,选取数量相等的积极微博数据和消极微博数据构建了一个百万量级的中文微博语料库,利用语料库对fastText分类器进行训练获得的微博情感分类器,可兼顾分类的准确性和时效性,能够准确反映话题的情感走向。

Description

一种微博情感分析方法及系统 技术领域
本发明涉及自然语言处理技术领域,特别是涉及一种微博情感分析方法及系统。
背景技术
微博话题的情感分析旨在探索社交网络上人们对于某一话题或事件的观点和态度。智能手机的普及使得越来越多的人从移动终端接入互联网,进入社交网络。新浪微博作为国内较大的社交网络平台,其日活跃用户量已经超过了1.5亿,平均每日发布的微博总数达两亿条之多。微博海量的数据中蕴含着丰富的实时信息,人们可以将生活动态和观点推送到微博上,也可以对热门事件进行评论。这些带有主观色彩的数据给情感分析的研究带来了很大的便利。微博的实时和时序情感信息挖掘可以准确的反映出微博话题走向并进行预警,对于个人、企业和政府来说都有积极意义。
微博的数据具有实时性和时效性,抓住微博信息的时效性,分析最新的话题数据,才能更大的发挥数据的价值。目前针对微博情感分析的研究大多致力于运用深度学习的方法提高情感分类器的分类性能,其使用的数据集也大多是该领域最典型的斯坦福Twitter英文情感分析数据集,并没有针对微博某一特定话题或领域的垂直时序分析,也并没有针对某一话题的大规模微博数据集。大多数研究都是在已有的数据集上进行静态的情感分析,时效性较差。
发明内容
本发明的目的是提供一种微博情感分析方法及系统,可兼顾分类的准确性和时效性,能够准确反映话题的情感走向。
为实现上述目的,本发明提供了如下方案:
一种微博情感分析方法,所述分析方法包括:
采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据;
将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微 博情感分类器的输出为积极微博或消极微博;所述微博情感分类器的建立方法具体包括:
采用通用网络爬虫采集若干微博文本数据作为分类训练数据;
获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;
利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;
选取数量相等的积极微博数据和消极微博数据构成语料库;
利用所述语料库对fastText分类器进行训练,获得所述微博情感分类器。
可选的,所述获取微博文本的特征表情词之前,还包括:
对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:
过滤掉微博文本数据中的颜文字和符号;
采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;
过滤掉字符长度小于设定阈值的微博文本数据。
可选的,所述选取数量相等的积极微博数据和消极微博数据构成语料库之前还包括:
判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;
当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;
判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果;
当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
可选的,所述将各个所述目标话题数据输入微博情感分类器之前,还包括:
随机选取与所述目标话题数据数量相同的微博文本数据作为约束训 练数据;
利用所述约束训练数据对所述fastText分类器进行训练,获得微博话题约束模型;
采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
可选的,所述将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型之后,还包括:
将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
一种微博情感分析系统,所述分析系统包括:
目标话题数据采集模块,用于采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据;
情感分析模块,用于将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博;所述微博情感分类器的建立子系统具体包括:
分类训练数据采集模块,用于采用通用网络爬虫采集若干微博文本数据作为分类训练数据;
特征表情词获取模块,用于获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;
微博数据分类模块,用于利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;
语料库构建模块,用于选取数量相等的积极微博数据和消极微博数据构成语料库;
分类器训练模块,用于利用所述语料库对fastText分类器进行训练,获得所述微博情感分类器。
可选的,所述微博情感分类器的建立子系统还包括:
去噪处理模块,用于对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:
过滤掉微博文本数据中的颜文字和符号;
采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;
过滤掉字符长度小于设定阈值的微博文本数据。
可选的,所述微博情感分类器的建立子系统还包括:
第一判断模块,用于判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;
第一过滤模块,用于当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;
第二判断模块,用于判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果;
第二过滤模块,用于当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
可选的,所述微博情感分析系统还包括:
约束训练数据选取模块,用于随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据;
约束模型确定模块,用于利用所述约束训练数据对所述fastText分类器进行训练,获得微博话题约束模型;
不相关话题清洗模块,用于采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
可选的,所述微博情感分析系统还包括:
时序分析模块,用于将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
根据本发明提供的具体实施例,本发明公开了以下技术效果:
本发明提供的微博情感分析方法及系统,采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据,将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型。本发明采用基于表情词和情感词的弱监督学习方法进行情感微博的过滤,选取数量相等的积极微博数据和消极微博数据构建了一个百万量级的 中文微博语料库,利用语料库对fastText分类器进行训练获得的微博情感分类器,可兼顾分类的准确性和时效性,能够准确反映话题的情感走向。
说明书附图
下面结合附图对本发明作进一步说明:
图1为本发明实施例提供的一种微博情感分析方法的流程图;
图2为本发明实施例提供的微博情感分类器的建立方法的流程图;
图3为本发明实施例提供的一种微博情感分析系统的结构框图;
图4为本发明实施例提供的微博情感分类器的建立子系统的结构框图;
图5为本发明实施例提供的数据去噪处理的流程图;
图6为本发明实施例提供的数据去噪处理结果图;
图7为本发明实施例提供的微博情感分类器的整体框架图;
图8为本发明实施例提供的以天为时间颗粒度的时序情感分析示意图;
图9为本发明实施例提供的以小时为时间颗粒度的时序情感分析的示意图。
具体实施方式
下面结合本发明实施例中的附图,对本发明实施例中技术方案进行详细的描述,显然,所描述的实施例仅是本发明一部分实施例,而不是全部的实施例;基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例都属于本发明保护的范围。
本发明的目的是提供一种微博情感分析方法及系统,可兼顾分类的准确性和时效性,能够准确反映话题的情感走向。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。
图1为本发明实施例提供的一种微博情感分析方法的流程图。如图1所示,一种微博情感分析方法,所述分析方法包括:
步骤101:采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据。
聚焦网络爬虫聚焦于某一目标话题,实现了特定话题特定时间段内微博文本的获取,既可以采集到该话题的历史微博数据,也可以采集到该话题的当天实时数据,采集到的数据用于垂直话题的时序实时情感分析。
步骤102:将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博。
优选地,执行步骤102:将各个所述目标话题数据输入微博情感分类器之前,还包括对所述目标话题数据进行去噪处理,获得去噪处理后的目标话题数据,所述去噪处理具体包括:过滤掉微博文本数据中的颜文字和符号;采用正则表达式对统一资源定位符(Uniform Resource Locator,URL)链接和邮箱进行匹配过滤;过滤掉字符长度小于设定阈值的微博文本数据。
优选地,执行步骤102:将各个所述目标话题数据输入微博情感分类器之前,还包括:
随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据。本实施例中,是将从分类训练数据中随机选取的与所述目标话题数据数量相同的微博文本数据作为约束训练数据。
利用所述约束训练数据对fastText分类器进行训练,获得微博话题约束模型。
采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,过滤掉话题不相关的微博,获得清洗后的目标话题数据。微博话题约束模型实际上是一个分类模型,用于对目标话题数据进行分类,分成话题相关微博和话题不相关微博,并将话题不相关的微博噪声过滤掉。
为了能够准确直观地反映话题的情感走向,执行所述步骤102:将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型之后,还包括:
将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上,便于对分类结果进行时序分析。将分类结果展示在时间轴上,能够实现天级和小时级等不同时间颗粒度的情感分析,从而 了解一个话题随着时间推移,其微博情感的变化情况。
图2为本发明实施例提供的微博情感分类器的建立方法的流程图。如图2所示,所述微博情感分类器的建立方法具体包括:
步骤201:采用通用网络爬虫采集若干微博文本数据作为分类训练数据。
采用通用网络爬虫采集大量的微博文本数据,其采用了多线程和代理的技术,实现了58万条/天的微博文本高并发爬取,采集到的分类训练数据用于情感分类器的训练。
步骤202:获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词。
利用具有强情感色彩的特征表情词对微博进行分类,带有积极表情的微博划分为积极微博,带有消极表情的微博划分为消极微博。
步骤203:利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据。
步骤204:选取数量相等的积极微博数据和消极微博数据构成语料库。
本实施例采用弱监督学习方法从数据集中提取出420万条情感积极的微博和68万条情感消极的微博。从情感积极的微博集中随机挑选出与消极微博数量相等的微博,构成了中文微博情感分析的语料库weibo_sentiment_corpus,用于下一步的情感分类器的训练。
为了防止在训练过程中分类器选取表情词作为特征,分配给表情词较大的权重,本实施例将语料库中每一条微博包含的表情词剥离掉。另外,考虑到分布式词向量的训练中每一个词的词向量都产生自上下文的关系,而停用词在上下文中还是能够提供有效信息的,所以本实施例并没有对停用词进行清理。
步骤205:利用所述语料库对所述fastText分类器进行训练,获得所述微博情感分类器。
本实施例中,语料库80%的微博文本作为训练集,20%的微博文本作 为测试集,最终分类准确率最高的fastText分类器的词向量长度为300维度,其达到了92.2%的准确率。通过提升词向量的维度可进一步提升分类器的准确率。
优选地,执行所述步骤202:获取微博文本的特征表情词之前,还包括:
对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:过滤掉微博文本数据中的颜文字和符号;采用正则表达式对统一资源定位符(Uniform Resource Locator,URL)链接和邮箱进行匹配过滤;过滤掉字符长度小于设定阈值的微博文本数据。
新浪微博中有大量的@和##符号,用来表示提到某人或是给微博加上标签,这些特有属性会在分类器的训练中带来噪声。同时,新浪微博上还有大量由特殊字符组成的颜文字,分类器不能识别这些字符的语义,所以要在去噪处理阶段过滤掉微博文本数据中的颜文字和符号,并采用正则表达式对URL链接和邮箱进行匹配过滤,然后计算每条微博文本数据的长度,从而过滤掉字符长度小于设定阈值的无效微博,其中一个汉字按照一个字符来算。可选地,设定阈值的范围为4-10,优选地,设定阈值为5。最后用结巴(jieba)分词对每条微博文本进行分词处理。
本实施例中NTUSD(National Taiwan University Sentiment Dictionary)情感词典被用来对微博进行双重过滤,一条微博如果包含有与表情词情感色彩相异的情感词,也会被过滤掉,即执行所述步骤204:选取数量相等的积极微博数据和消极微博数据构成语料库之前还包括:
判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果。
当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;
判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果。
当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
表1 典型表情词
Figure PCTCN2019120584-appb-000001
本实施例为了提取到带有情感色彩而且没有表情歧义的微博,采用手动方式挑选出了具有强情感色彩的特征表情词,如表1所示,其中包含有18个典型的消极表情词和37个典型的积极表情词。NTUSD情感词典也被用作微博的双重过滤,一条微博如果包含有与表情词情感色彩相异的情感词,也会被过滤掉。一条微博被划分为情感微博,首先该微博必须包含有特征表情词,用正则表达式\[[a-zA-z\u4e00-\u9fff]{1,5}\]来获取文本中所有的表情词,如果微博文本包含有表情词并且只包含有一类表情词,例如只包含有积极表情词,那么再判断该微博中的其他词语是否与情感词典中的消极情感词有交集,如果不包含有消极情感词,那么该条微博就被划分为积极微博。整个过滤过程算法如下,
输入:情感词典NTUSD,微博数据集weibos,微博表情词典emoji_dict
输出:积极微博集合pos_set,消极微博集合neg_set
pos_set←
neg_set←//输出集合初始化
pos_emotions,neg_emotions←load(NTUSD)
pos_emojis,neg_emojis←load(emoji_dict)
for weibo in weibos do
words←set(weibo)//微博词去重
if len(words&pos_emojis)>0 and
len(words&neg_emotions)==0 and len(words&neg_emojis)==0 then//&求交集
add weibo to pos_set//将满足条件的微博加入积极微博集合中
end if
if len(words&neg_emojis)>0 and
len(words&pos_emotions)==0 and len(words&pos_emojis)==0 then
add weibo to neg_set
end if
end for
return pos_set,neg_set//返回积极和消极微博
可见,对于一条微博,如果将其划分为情感积极的微博,那么其必须满足三个条件,第一,包含积极的表情词,第二,不包含消极的表情词,第三,不包含消极的情感词,其中表情词来自微博表情词典,情感词来自NTUSD词典。虽然情感微博过滤算法的过滤条件比较严格,但是3500万条的微博数据总量还是保证了其能过滤出大量具有强烈情感色彩的微博数据。
图3为本发明实施例提供的一种微博情感分析系统的结构框图。如图3所示,一种微博情感分析系统,所述分析系统包括:
目标话题数据采集模块301,用于采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据。
情感分析模块302,用于将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博。
优选地,所述微博情感分析系统还包括:
约束训练数据选取模块,用于随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据;
约束模型确定模块,用于利用所述约束训练数据对所述fastText分类器进行训练,获得微博话题约束模型;
不相关话题清洗模块,用于采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
为了能够准确直观第反映话题的情感走向,所述微博情感分析系统还包括:
时序分析模块,用于将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
图4为本发明实施例提供的所述微博情感分类器的建立子系统的结构框图。如图4所示,所述微博情感分类器的建立子系统包括:
分类训练数据采集模块401,用于采用通用网络爬虫采集若干微博文本数据作为分类训练数据。
特征表情词获取模块402,用于获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;
微博数据分类模块403,用于利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;
语料库构建模块404,用于选取数量相等的积极微博数据和消极微博数据构成语料库;
分类器训练模块405,用于利用所述语料库对所述fastText分类器进行训练,获得所述微博情感分类器。
优选地,所述微博情感分类器的建立子系统还包括:
去噪处理模块,用于对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:
过滤掉微博文本数据中的颜文字和符号;
采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;
过滤掉字符长度小于设定阈值的微博文本数据。
优选地,所述微博情感分类器的建立子系统还包括:
第一判断模块,用于判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;
第一过滤模块,用于当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;
第二判断模块,用于判断所述消极微博数据中是否存在情感极性词典 中的积极情感词,获得第二判断结果;
第二过滤模块,用于当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
本发明提供的微博情感分析系统的实施流程如下:
S1,微博通用爬虫结合微博应用程序接口,采集到3500万条各种话题的微博文本作为分类训练数据;
S2,微博聚焦爬虫结合微博应用程序接口,对特定话题的微博进行实时数据和历史数据的获取,作为目标话题数据。其中,目标话题数据包含了每条微博发布的时间信息,后期用来进行时序分析。
S3,分别对步骤S1和S2采集到的数据进行清洗并分词。具体包括:
S301:利用正则表达式匹配微博中最常见的@和#符号,清洗掉@和@附带的用户名,过滤掉#和#代表的所有标签;
S302:利用正则表达式匹配并过滤微博文本中的URL链接和邮箱地址,经过统计,采集到的3500万条微博文本中,有67万条包含了URL链接和邮箱地址,即平均每100条数据中有两条需要进行清理。
S303:将网络上常见的颜文字拆解,得到特殊字符字典,并利用该字典过滤微博文本中的特殊字符;
S304:删除每条微博文本中多余的空格,并计算每一条微博文本的长度,若,则过滤掉该条微博。
如图5所示,微博文本的清洗工作有如下四个步骤:微博特有属性清洗、URL链接和邮箱清洗、特殊字符清洗和短微博清洗。源数据共有6.34GB大小,图6展示了在每一次清理后,剩余的数据量大小。新浪微博中有大量的@和##符号,用来表示提到某人或是给微博加上标签,这些特有的属性会在分类器的训练中带来噪声,而且由于微博长度限制的解禁,一条微博里可能含有多个标签,若不去掉标签,后期分类器的训练中会分配给标签较大的权重。清理掉微博的特有属性后,总数据量从6.34GB减少为6.12GB。经过统计,微博数据集中有67万条包含有URL链接或邮箱地址的文本,即平均每100条数据中,有两条是包含有链接和邮箱的, 我们采用了正则表达式对URL链接和邮箱进行匹配并过滤。经过过滤,总数据量从6.12GB减少为6.11GB。由于网络用语的随意性和微博用户群体的年轻化,新浪微博上有大量由特殊字符组成的颜文字,这些颜文字充斥在文本的各个角落,机器不能识别这些字符的语义,所以要在预处理阶段过滤掉。过滤掉特殊字符后,总数据量从6.11GB减少为5.75GB。最后一个预处理步骤是短微博的过滤,经过上述步骤的清洗,很多微博文本的长度会变短,设定字符长度小于5的微博为无效微博,其中一个汉字按照一个字符来算。经过过滤,有228万条无效的短微博被过滤掉,最后数据集中还剩下3348万条有效微博,大小共计5.21GB。
S4,利用微博特征表情词和情感极性词典,对通用网络爬虫采集到的数据进行弱监督学习训练,过滤出带有强烈情感色彩的微博,作为微博情感分析的语料库。
所述步骤S4中的微博表情词为具有强烈情感色情的积极表情词和消极表情词,情感极性词典使用了NTUSD情感词典。以构建情感积极的微博集合positive_set为例,对于每一条微博文本,如果其中包含有积极表情词并且不包含有消极表情词和NTUSD中的消极情感词,那么该条微博就加入positive_set,消极微博集合negative_set的构建同理。最后通过该弱监督方法过滤得到了positive_set和negative_set,集合大小都是68万,共同组成了微博情感分析的语料库weibo_sentiment_corpus,其中有134万条情感微博,该语料库为目前已知最大的中文微博情感分析语料库。
最后过滤得到的weibo_sentiment_corpus中,是包含有停用词的,并没有遵循传统的文本清洗中过滤停用词的步骤,实验证明,基于有停用词的训练集的分类器,其准确率比无停用词训练得到的分类器高0.4个百分点。另外,语料库中的微博文本是已经过滤掉表情词的文本,这样避免了训练分类器的过程中,表情词被赋予较大的权重,影响分类的精度。
S5,利用fastText对步骤S4生成的语料库进行情感分类器的训练,得到微博情感分类器。
微博情感分类器的训练选取了weibo_senti-ment_corpus中80%的微博作为训练集,20%的微博作为测试集。通过测试集的测试结果可知,该 分类器的分类准确率达到了92.2%。
S6,采用微博话题约束模型对步骤S2采集到的目标话题的微博数据中的话题不相关微博进行过滤。微博话题约束模型采用了fastText进行约束模型的训练,其训练得到的微博话题约束模型能够过滤掉话题不相关的噪声微博。
S7,利用步骤S5生成的情感分类器对步骤S6过滤得到的微博话题数据进行分类,具体包括:
S701:对于特定的目标话题微博,每隔10分钟采集一次最新数据;
S702:对于S701中采集到的最新数据,采用步骤S6中的微博话题约束模型过滤话题不相关的噪声微博;
S703:对于S702中过滤得到的目标话题微博,采用步骤S3进行清洗,然后存储到数据库中;
S704:对于S703中清洗完毕的数据,采用步骤S5训练得到的微博情感分类器进行分类,然后将分类结果按照时间顺序同步到数据库中。
S8,将步骤S7中分类结果按照时间顺序动态的展示出来,从而实现特定话题的实时时序情感分析。
步骤S8中对于微博特定话题的时序情感分析属于应用层的分析。从数据库中读取目标话题微博的所有分类结果,然后按照时间顺序将情感的分类结果绘制成图,图的轴为时间轴,轴为微博条数。图中有两条曲线,位于轴上方的曲线代表特定话题随着时间变化的积极情感程度,位于轴下方的曲线代表该话题随着时间变化的消极情感程度。
如图6所示,本发明提出的微博情感分析方法获得的微博情感分类器一共分为四层,分别是:数据采集和预处理层、模型层、数据存储层和应用层。
数据采集和预处理层中,通用网络爬虫和聚焦网络爬虫负责数据的采集,采集到的数据经过短暂的存储后,进行数据的预处理。在该层中,通用爬虫一共采集到3500万条微博,文件共计6.34GB大小。聚焦网络爬虫可以采集任一特定话题的历史数据和实时数据。本实施例挑选了2018年4月比较热门的话题:中兴危机,作为例子进行说明,也利用聚焦网络 爬虫采集到了和中兴话题相关的从2018年1月1日到2018年5月1日的历史微博文本,共计3.8万条。
在模型层,一共有三个模型,分别是弱监督学习数据集生成模型、微博话题约束模型和情感极性分类模型。数据集生成模型采用了基于特征表情词和NTUSD情感词典的弱监督学习方法,过滤得到了420万条情感积极的微博和68万条情感消极的微博。由于微博总体情感偏向于积极,而且积极特征表情词的数量是消极表情词数量的两倍,所以最终提取出来的情感积极的微博数量是远远大于消极微博的。为了防止训练情感分类器的过程中,一条微博被赋予较大的积极情感的先验概率,故从情感积极的微博集中随机挑选出与消极微博数等量的微博,构成了中文微博情感分析的语料库,用于情感分类器的训练。另外,对于每一条微博文本,按照其包含的微博表情词的情感类别,在文本开头加入“__label__negative”或“__label__positive”作为分类的标签。对于微博话题约束模型,在采集到特定话题的历史数据集history_set后,从通用网络爬虫采集到的数据中随机抓取与特定话题微博数量相等的微博,作为话题不相关的微博数据集irrelevant_topic_set,与history_set一起放入分类器中进行训练,生成话题分类器即微博话题约束模型,最后采用微博话题约束模型对history_set和实时爬取到的微博中的每一条微博进行话题相关度分析,相关度范围α为:0≤α≤1。设定相关度阈值,如果α≥0.6,那么该条微博即为话题相关微博。对于最后的微博情感分类器,在实际的分类器训练中,将语料库中80%的微博文本作为训练集,20%的微博文本作为测试集。实验结果证明,fastText可以在100秒内对3400万词汇量、字典大小为36万的数据集进行快速训练,基于有停用词的训练集的分类器的分类准确率达到了92.2%,比基于无停用词的训练集训练得到的分类器高0.4%。可见,停用词在分布式词向量的训练中是起作用的。传统的基于规则和统计的模型之所以要去除停用词,根本原因在于其只学习到了文字的符号意义,每两个词都是相互割裂开来的,并不是根据上下文去推断一个词的语义。本发明是基于一个大规模的数据集得到的,如果将话题缩小到某个具体的领域,那么分类器的分类性能应该会得到进一步的提升。
基于中兴公司在2018年的前四个月有较高的话题热度,本发明对中兴的话题微博进行详细的分析。在数据的采集阶段,聚焦网络爬虫采集到了3.8万条中兴话题的微博,平均每天310条微博,每小时13条微博,该数据量足以支撑以天数和小时为单位的时序情感分析。从聚焦爬虫中随机抽取出3.8万条微博,这些微博大概率上是与中兴话题无关的微博,作为训练集中的不相关话题和3.8万条中兴话题微博一起送入fastText进行话题约束模型的训练。模型训练好后,再对3.8万条中兴话题的微博进行重新分类,如果其与中兴话题的相关度大于或等于60%,即属于中兴话题的概率大于或等于60%,那么该条微博就作为话题相关的微博。
利用之前训练出来的微博情感分类器对中兴话题微博中的每一条微博进行分类,最后按照时间顺序得到了每一条微博的情感分类结果。如图7所示,以天数为单位对2018年3月21日到2018年4月27日的中兴话题微博进行了时序情感分析,从图8中可以直观的看到,3月21日到4月15日的微博情感都是偏正面的,直到4月16日中兴被美国商务部制裁的事件发生,情感消极的微博急剧增加并超过了情感积极的微博,这反映了中兴面临的巨大危机。再将时间聚焦到4月16日的24小时进行更垂直的分析,从图9可以看出,情感积极的微博数量在大部分时间里多于情感消极的微博,而在21点之后,情感消极的微博数量突然增加并且不断增长,这说明晚上9点到10点之间中兴产生了舆论危机,这正好对应了美国商务部在美国东部时间9时宣布的制裁中兴事件。
以上的分析证明了微博话题的垂直时序分析的可行性。不同于百度指数和微博微指数的热度曲线,本发明提出的分析方法在兼顾话题事件热度的同时,还实现了积极情感和消极情感的动态时序分析,能够快速直观的反映出话题的情感变化。这也从应用的角度证明了本发明训练出来的微博情感分类器的有效性与实用性。
本发明实现了微博垂直话题的实时和时序情感分析,在兼顾微博情感分类器准确率的同时,增强了微博情感分析的实时性和时效性。本发明基于表情词和情感词的弱监督学习方法,构建了一个百万量级的中文微博情感分析语料库,该语料库目前为该领域最大的语料库。本发明克服了基于 词袋模型的向量稀疏问题,采用fastText进行分布式词向量和情感分类器的训练,从而学习到微博短文本更多的语义。本发明提出的微博话题约束模型,实现了特定话题微博数据集中噪声微博的过滤。实验结果表明,本发明提供的微博情感分类器的准确率达到了92.2%,在此基础上实现的微博话题的时序情感分析也能准确反映话题的情感走向。
上面结合附图对本发明的实施方式作了详细说明,但是本发明并不限于上述实施方式,在所属技术领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下做出各种变化。

Claims (10)

  1. 一种微博情感分析方法,其特征在于,所述分析方法包括:
    采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据;
    将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博;所述微博情感分类器的建立方法具体包括:
    采用通用网络爬虫采集若干微博文本数据作为分类训练数据;
    获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;
    利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;
    选取数量相等的积极微博数据和消极微博数据构成语料库;
    利用所述语料库对fastText分类器进行训练,获得所述微博情感分类器。
  2. 根据权利要求1所述的微博情感分析方法,其特征在于,所述获取微博文本的特征表情词之前,还包括:
    对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:
    过滤掉微博文本数据中的颜文字和符号;
    采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;
    过滤掉字符长度小于设定阈值的微博文本数据。
  3. 根据权利要求1所述的微博情感分析方法,其特征在于,所述选取数量相等的积极微博数据和消极微博数据构成语料库之前还包括:
    判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;
    当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;
    判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果;
    当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
  4. 根据权利要求1所述的微博情感分析方法,其特征在于,所述将各个所述目标话题数据输入微博情感分类器之前,还包括:
    随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据;
    利用所述约束训练数据对所述fastText分类器进行训练,获得微博话题约束模型;
    采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
  5. 根据权利要求1所述的微博情感分析方法,其特征在于,所述将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型之后,还包括:
    将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
  6. 一种微博情感分析系统,其特征在于,所述分析系统包括:
    目标话题数据采集模块,用于采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据;
    情感分析模块,用于将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博;所述微博情感分类器的建立子系统具体包括:
    分类训练数据采集模块,用于采用通用网络爬虫采集若干微博文本数据作为分类训练数据;
    特征表情词获取模块,用于获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;
    微博数据分类模块,用于利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;
    语料库构建模块,用于选取数量相等的积极微博数据和消极微博数据构成语料库;
    分类器训练模块,用于利用所述语料库对fastText分类器进行训练,获得所述微博情感分类器。
  7. 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分类器的建立子系统还包括:
    去噪处理模块,用于对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:
    过滤掉微博文本数据中的颜文字和符号;
    采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;
    过滤掉字符长度小于设定阈值的微博文本数据。
  8. 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分类器的建立子系统还包括:
    第一判断模块,用于判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;
    第一过滤模块,用于当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;
    第二判断模块,用于判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果;
    第二过滤模块,用于当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
  9. 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分析系统还包括:
    约束训练数据选取模块,用于随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据;
    约束模型确定模块,用于利用所述约束训练数据对所述fastText分类 器进行训练,获得微博话题约束模型;
    不相关话题清洗模块,用于采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
  10. 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分析系统还包括:
    时序分析模块,用于将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
PCT/CN2019/120584 2018-11-28 2019-11-25 一种微博情感分析方法及系统 WO2020108430A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811432829.3 2018-11-28
CN201811432829.3A CN109543110A (zh) 2018-11-28 2018-11-28 一种微博情感分析方法及系统

Publications (1)

Publication Number Publication Date
WO2020108430A1 true WO2020108430A1 (zh) 2020-06-04

Family

ID=65850645

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/120584 WO2020108430A1 (zh) 2018-11-28 2019-11-25 一种微博情感分析方法及系统

Country Status (2)

Country Link
CN (1) CN109543110A (zh)
WO (1) WO2020108430A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115331A (zh) * 2020-09-21 2020-12-22 朱彤 基于分布式网络爬虫与nlp的资本市场舆情监测方法

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543110A (zh) * 2018-11-28 2019-03-29 南京航空航天大学 一种微博情感分析方法及系统
CN109977231B (zh) * 2019-04-10 2021-04-02 上海海事大学 一种基于情感衰变因子的抑郁情绪分析方法
CN110674415B (zh) * 2019-09-20 2022-06-17 北京浪潮数据技术有限公司 一种信息显示方法、装置及服务器
CN110941759B (zh) * 2019-11-20 2022-11-11 国元证券股份有限公司 一种微博情感分析方法
CN111078879A (zh) * 2019-12-09 2020-04-28 北京邮电大学 基于深度学习的卫星互联网文本敏感信息检测方法及装置
CN111125548A (zh) * 2019-12-31 2020-05-08 北京金堤科技有限公司 舆论监督方法和装置、电子设备和存储介质
CN111611455A (zh) * 2020-05-22 2020-09-01 安徽理工大学 一种微博热点话题下基于用户情感行为特征的用户群体划分方法
CN111680132B (zh) * 2020-07-08 2023-05-19 中国人民解放军国防科技大学 一种用于互联网文本信息的噪声过滤和自动分类方法
CN111986259A (zh) * 2020-08-25 2020-11-24 广州市百果园信息技术有限公司 颜文字检测模型的训练、视频数据的审核方法及相关装置
CN112559746A (zh) * 2020-12-11 2021-03-26 南京邮电大学 一种产品评论挖掘方法和系统
CN116562302A (zh) * 2023-06-29 2023-08-08 昆明理工大学 融合汉越关联关系的多语言事件观点对象识别方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407449A (zh) * 2016-09-30 2017-02-15 四川长虹电器股份有限公司 一种基于支持向量机的情感分类方法
US20180032870A1 (en) * 2015-10-22 2018-02-01 Tencent Technology (Shenzhen) Company Limited Evaluation method and apparatus based on text analysis, and storage medium
CN108536674A (zh) * 2018-03-21 2018-09-14 上海蔚界信息科技有限公司 一种基于语义的典型意见聚合方法
CN109543110A (zh) * 2018-11-28 2019-03-29 南京航空航天大学 一种微博情感分析方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390051B (zh) * 2013-07-25 2016-07-20 南京邮电大学 一种基于微博数据的话题发现与追踪方法
WO2017051425A1 (en) * 2015-09-23 2017-03-30 Devanathan Giridhari A computer-implemented method and system for analyzing and evaluating user reviews

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032870A1 (en) * 2015-10-22 2018-02-01 Tencent Technology (Shenzhen) Company Limited Evaluation method and apparatus based on text analysis, and storage medium
CN106407449A (zh) * 2016-09-30 2017-02-15 四川长虹电器股份有限公司 一种基于支持向量机的情感分类方法
CN108536674A (zh) * 2018-03-21 2018-09-14 上海蔚界信息科技有限公司 一种基于语义的典型意见聚合方法
CN109543110A (zh) * 2018-11-28 2019-03-29 南京航空航天大学 一种微博情感分析方法及系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUO, JIE: "Research and Application of Sentiment Classification Technology Based on Web Comments", CHINESE MASTER'S THESES FULL-TEXT DATABASE (ELECTRONIC JOURNAL), INFORMATION SCIENCE AND TECHNOLOGY, 31 August 2018 (2018-08-31), DOI: 20200217120004PX *
WAN, SHUO: "Vertical and Sequential Sentiment Analysis of Micro-blog Topic", INTERNATIONAL CONFERENCE ON ADVANCED DATA MINING AND APPLICATIONS ADMA 2018, 29 December 2018 (2018-12-29), DOI: 20200217115908PX *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115331A (zh) * 2020-09-21 2020-12-22 朱彤 基于分布式网络爬虫与nlp的资本市场舆情监测方法
CN112115331B (zh) * 2020-09-21 2021-05-04 朱彤 基于分布式网络爬虫与nlp的资本市场舆情监测方法

Also Published As

Publication number Publication date
CN109543110A (zh) 2019-03-29

Similar Documents

Publication Publication Date Title
WO2020108430A1 (zh) 一种微博情感分析方法及系统
Saad et al. Twitter sentiment analysis based on ordinal regression
CN106980692B (zh) 一种基于微博特定事件的影响力计算方法
TWI653542B (zh) 一種基於網路媒體資料流程發現並跟蹤熱點話題的方法、系統和裝置
Hammad et al. An approach for detecting spam in Arabic opinion reviews
CN111143576A (zh) 一种面向事件的动态知识图谱构建方法和装置
TWI501097B (zh) 文字串流訊息分析系統和方法
CN106940732A (zh) 一种面向微博的疑似水军发现方法
CN103617290B (zh) 中文机器阅读系统
CN104216964B (zh) 一种面向微博的非分词突发话题检测方法
CN105354216B (zh) 一种中文微博话题信息处理方法
Pan et al. Deep neural network-based classification model for Sentiment Analysis
Srikanth et al. [Retracted] Sentiment Analysis on COVID‐19 Twitter Data Streams Using Deep Belief Neural Networks
Alp et al. Extracting topical information of tweets using hashtags
CN111783456A (zh) 一种利用语义分析技术的舆情分析方法
Nahar et al. Sentiment analysis and emotion extraction: A review of research paradigm
CN110019763B (zh) 文本过滤方法、系统、设备及计算机可读存储介质
Dhanalakshmi et al. Sentiment analysis using VADER and logistic regression techniques
CN105205075B (zh) 基于协同自扩展的命名实体集合扩展方法及查询推荐方法
Zhang et al. Spam comments detection with self-extensible dictionary and text-based features
CN109871889A (zh) 突发事件下大众心理评估方法
CN111221941B (zh) 基于文本内容和行文风格的社交媒体谣言鉴别算法
Suresh An innovative and efficient method for Twitter sentiment analysis
Li et al. Identification of public opinion on COVID-19 in microblogs
Kumar et al. Real-time hashtag based event detection model with sentiment analysis for recommending user tweets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19891385

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19891385

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19891385

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 18/03/2022)