WO2020108430A1 - 一种微博情感分析方法及系统 - Google Patents
一种微博情感分析方法及系统 Download PDFInfo
- Publication number
- WO2020108430A1 WO2020108430A1 PCT/CN2019/120584 CN2019120584W WO2020108430A1 WO 2020108430 A1 WO2020108430 A1 WO 2020108430A1 CN 2019120584 W CN2019120584 W CN 2019120584W WO 2020108430 A1 WO2020108430 A1 WO 2020108430A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- microblog
- sentiment
- negative
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Definitions
- the invention relates to the technical field of natural language processing, in particular to a microblog sentiment analysis method and system.
- the sentiment analysis of Weibo topics aims to explore people's views and attitudes on a topic or event on social networks.
- the popularity of smart phones has enabled more and more people to access the Internet from mobile terminals and enter social networks.
- Sina Weibo has more than 150 million daily active users, and the average daily number of Weibo posts reaches 200 million.
- the massive amount of data in Weibo contains rich real-time information. People can push their life trends and opinions to Weibo, and they can also comment on popular events. These subjective data bring great convenience to the research of sentiment analysis.
- the real-time and time-series emotional information mining of Weibo can accurately reflect the trend of Weibo topics and provide early warning, which has positive significance for individuals, enterprises and governments.
- the data of Weibo has real-time and timeliness. Only by grasping the timeliness of Weibo information and analyzing the latest topic data can the value of the data be brought into full play.
- most of the research on sentiment analysis of Weibo is committed to using deep learning methods to improve the classification performance of sentiment classifiers.
- Most of the data sets used are the most typical Stanford Twitter sentiment analysis datasets in the field in English. There is no large-scale microblog data set for a specific topic or vertical time series analysis of a specific topic or field. Most studies are based on static sentiment analysis on existing data sets, and the timeliness is poor.
- the purpose of the present invention is to provide a microblog sentiment analysis method and system, which can take into account the accuracy and timeliness of classification, and can accurately reflect the sentiment trend of the topic.
- the present invention provides the following solutions:
- a microblog sentiment analysis method includes:
- Each of the target topic data is input to a Weibo sentiment classifier to obtain the sentiment type of each target topic data, the input of the Weibo sentiment classifier is Weibo text data, and the output of the Weibo sentiment classifier is Positive Weibo or negative Weibo; the method for establishing the Weibo sentiment classifier specifically includes:
- Adopt general web crawler to collect several Weibo text data as classification training data
- the characteristic emoji words of the microblog text and the characteristic emoticons include positive emoji words and negative emoji words;
- the method before acquiring the characteristic emoticons of the microblog text, the method further includes:
- the de-noising process specifically includes:
- the method before selecting equal amounts of positive microblog data and negative microblog data to form a corpus, the method further includes:
- the negative microblog data with the positive emotion word is filtered out.
- the method before inputting the target topic data into the Weibo sentiment classifier, the method further includes:
- the target topic data is cleaned by using the Weibo topic constraint model to obtain the target topic data after cleaning.
- the method further includes:
- the sentiment types of each of the target topic data are arranged on the time axis according to the release time of the corresponding target topic data.
- a microblog sentiment analysis system includes:
- the target topic data collection module is used to collect several microblog text data of the target topic within a preset time period as the target topic data by using the focused web crawler;
- a sentiment analysis module for inputting each target topic data into a Weibo sentiment classifier to obtain the sentiment type of each target topic data, the input of the Weibo sentiment classifier is microblog text data, The output of the sentiment classifier is a positive Weibo or a negative Weibo; the establishment subsystem of the sentiment classifier of the Weibo specifically includes:
- Classification training data collection module used to collect several Weibo text data as general training data using general web crawler
- the characteristic expression word acquisition module is used to obtain characteristic expression words of the microblog text, and the characteristic expression words include positive expression words and negative expression words;
- a microblog data classification module for classifying the classification training data using the characteristic expression words to obtain positive microblog data and negative microblog data, the positive microblog data being microblog data with positive expression words ,
- the negative microblog data is microblog data with negative expressions;
- a corpus building module used to select an equal number of positive Weibo data and negative Weibo data to form a corpus
- a classifier training module is used to train the fastText classifier using the corpus to obtain the Weibo sentiment classifier.
- the establishment subsystem of the Weibo sentiment classifier further includes:
- the denoising processing module is used for denoising the classification training data to obtain the denoising processed training data.
- the denoising processing specifically includes:
- the establishment subsystem of the Weibo sentiment classifier further includes:
- the first judgment module is used to judge whether there is a negative emotion word in the emotion polarity dictionary in the positive microblog data to obtain a first judgment result;
- a first filtering module configured to filter out positive microblog data with negative emotion words when the first judgment result indicates that negative emotional words in an emotional polarity dictionary exist in the positive microblog data;
- a second judgment module used to judge whether there is a positive emotion word in the emotion polarity dictionary in the negative microblog data, to obtain a second judgment result
- the second filtering module is configured to filter out negative microblog data with positive emotion words when the second judgment result indicates that there are positive emotion words in the emotional polarity dictionary in the negative microblog data.
- the Weibo sentiment analysis system further includes:
- the constrained training data selection module is used to randomly select microblog text data with the same number of target topic data as the constrained training data
- a constraint model determination module for training the fastText classifier using the constraint training data to obtain a Weibo topic constraint model
- the irrelevant topic cleaning module is configured to perform irrelevant topic cleaning on the target topic data using the Weibo topic constraint model to obtain the cleaned target topic data.
- the Weibo sentiment analysis system further includes:
- the timing analysis module is used to arrange the sentiment types of the target topic data on the time axis according to the release time of the corresponding target topic data.
- the present invention discloses the following technical effects:
- the microblog sentiment analysis method and system provided by the present invention adopts a focused web crawler to collect several microblog text data of a target topic within a preset time period as target topic data, and inputs each target topic data into a microblog sentiment classifier, Obtain the emotion type of each target topic data.
- the invention adopts a weakly supervised learning method based on expression words and emotion words to filter emotional microblogs, selects equal numbers of positive microblog data and negative microblog data to construct a million-scale Chinese microblog corpus, and uses the corpus to
- the microblog sentiment classifier obtained by fastText classifier training can take into account the accuracy and timeliness of the classification, and can accurately reflect the emotional trend of the topic.
- FIG. 1 is a flowchart of a microblog sentiment analysis method provided by an embodiment of the present invention
- FIG. 2 is a flowchart of a method for establishing a microblog sentiment classifier provided by an embodiment of the present invention
- FIG. 3 is a structural block diagram of a microblog sentiment analysis system provided by an embodiment of the present invention.
- FIG. 4 is a structural block diagram of a building subsystem of a microblog sentiment classifier provided by an embodiment of the present invention
- FIG. 7 is an overall framework diagram of a microblog sentiment classifier provided by an embodiment of the present invention.
- FIG. 8 is a schematic diagram of time series sentiment analysis using days as the time granularity provided by an embodiment of the present invention.
- FIG. 9 is a schematic diagram of time series sentiment analysis using hour as a time granularity according to an embodiment of the present invention.
- the purpose of the present invention is to provide a microblog sentiment analysis method and system, which can take into account the accuracy and timeliness of classification, and can accurately reflect the sentiment trend of the topic.
- FIG. 1 is a flowchart of a microblog sentiment analysis method provided by an embodiment of the present invention. As shown in FIG. 1, a microblog sentiment analysis method, the analysis method includes:
- Step 101 Use focused web crawlers to collect several microblog text data of a target topic within a preset time period as target topic data.
- Focused web crawler focuses on a target topic, and achieves the acquisition of Weibo text within a specific time period for a specific topic.
- Both historical microblog data of the topic can be collected, and real-time data of the topic can be collected on the same day.
- the data is used for time-series real-time sentiment analysis of vertical topics.
- Step 102 Input each target topic data into a Weibo sentiment classifier to obtain the sentiment type of each target topic data, the input of the Weibo sentiment classifier is Weibo text data, and the Weibo sentiment classifier 'S output is positive Weibo or negative Weibo.
- step 102 is performed: before inputting each of the target topic data into the Weibo sentiment classifier, the method further includes performing denoising on the target topic data to obtain target topic data after denoising, and the denoising process It includes: filtering out emoticons and symbols in Weibo text data; using regular expressions to match and filter Uniform Resource Locator (URL) links and mailboxes; filtering out Weibo whose character length is less than the set threshold text data.
- URL Uniform Resource Locator
- the method before performing step 102: before inputting each of the target topic data into the Weibo sentiment classifier, the method further includes:
- the microblog text data that is randomly selected from the classification training data and has the same number of target topic data as constraint training data.
- the Weibo topic constraint model is actually a classification model, used to classify the target topic data, divided into topic-related microblogs and topic-unrelated microblogs, and filtering out the noises of topic-unrelated microblogs.
- the step 102 is performed: after inputting each target topic data into the Weibo sentiment classifier, and obtaining the emotion type of each target topic data, the method further includes:
- the sentiment types of each of the target topic data are arranged on the time axis according to the release time of the corresponding target topic data, so as to facilitate time-series analysis of the classification results. Displaying the classification results on the time axis can realize the sentiment analysis of different time granularities such as days and hours, so as to understand how a topic's sentiment changes with time.
- FIG. 2 is a flowchart of a method for establishing a microblog sentiment classifier provided by an embodiment of the present invention. As shown in FIG. 2, the establishment method of the Weibo sentiment classifier specifically includes:
- Step 201 Use a general web crawler to collect several microblog text data as classification training data.
- a general web crawler is used to collect a large amount of microblog text data, which uses multi-threading and proxy technology to achieve high concurrent crawling of 580,000 articles/day of microblog text.
- the collected classification training data is used for the emotion classifier. training.
- Step 202 Obtain the characteristic expression words of the microblog text, and the characteristic expression words include positive expression words and negative expression words.
- Micro-blogs are classified using characteristic expression words with strong emotional colors. Micro-blogs with positive expressions are classified as positive micro-blogs, and micro-blogs with negative expressions are classified as negative micro-blogs.
- Step 203 Use the characteristic expression words to classify the classification training data to obtain positive microblog data and negative microblog data.
- the positive microblog data is microblog data with positive expression words
- the negative microblog Blog data is Weibo data with negative expressions.
- Step 204 Select equal amounts of positive microblog data and negative microblog data to form a corpus.
- a weakly supervised learning method is used to extract 4.2 million positive emotional microblogs and 680,000 negative emotional microblogs from the data set. Randomly select the microblogs with the same number of negative microblogs from the positive emotional microblog set, which constitutes the corpus of Chinese Weibo sentiment analysis weibo_sentiment_corpus, which is used for the next training of emotion classifier.
- this embodiment strips off emoticons contained in each microblog in the corpus.
- this embodiment does not clean up the stop words .
- Step 205 Use the corpus to train the fastText classifier to obtain the Weibo sentiment classifier.
- the word vector length of the fastText classifier with the highest final classification accuracy is 300 dimensions, which is 92.2% accurate rate.
- the accuracy of the classifier can be further improved by increasing the dimension of the word vector.
- the method before performing the step 202: before obtaining the characteristic emoticons of the microblog text, the method further includes:
- the de-noising process specifically includes: filtering out the emoticons and symbols in the Weibo text data; using regular expressions to locate uniform resources Character (Uniform Resource Locator, URL) links and mailboxes are matched and filtered; filter out microblog text data whose character length is less than the set threshold.
- uniform resources Character Uniform Resource Locator, URL
- Step 204 Before selecting equal amounts of positive microblog data and negative microblog data to form a corpus, the method further includes:
- the negative microblog data with the positive emotion word is filtered out.
- the characteristic expression words with strong emotional colors are manually selected, as shown in Table 1, which contains 18 typical negative expression words and 37 typical positive emoticons.
- the NTUSD sentiment dictionary is also used as a double filter for Weibo. If a Weibo contains emotion words that are different from the emotion color of the emoticon, it will also be filtered out.
- a micro-blog is divided into emotional micro-blogs.
- the micro-blog must contain characteristic emoticons, and use regular expression ⁇ [[a-zA-z ⁇ u4e00- ⁇ u9fff] ⁇ 1,5 ⁇ ] to get the text
- ⁇ [[a-zA-z ⁇ u4e00- ⁇ u9fff] ⁇ 1,5 ⁇ ] to get the text
- the microblog text contains emoji words and contains only one type of emoji words, for example, only contains positive emoji words, then determine whether other words in the microblog have an intersection with the negative emotion words in the sentiment dictionary , If it does not contain negative emotion words, then the micro-blog is classified as a positive micro-blog.
- the algorithm of the entire filtering process is as follows,
- FIG. 3 is a structural block diagram of a microblog sentiment analysis system provided by an embodiment of the present invention. As shown in FIG. 3, a microblog sentiment analysis system, the analysis system includes:
- the target topic data collection module 301 is configured to use focused web crawlers to collect several microblog text data of the target topic within a preset time period as target topic data.
- the sentiment analysis module 302 is used to input each target topic data into a Weibo sentiment classifier to obtain the sentiment type of each target topic data.
- the input of the Weibo sentiment classifier is Weibo text data, the micro
- the output of the Bo sentiment classifier is positive Weibo or negative Weibo.
- the Weibo sentiment analysis system further includes:
- the constrained training data selection module is used to randomly select microblog text data with the same number of target topic data as the constrained training data
- a constraint model determination module for training the fastText classifier using the constraint training data to obtain a Weibo topic constraint model
- the irrelevant topic cleaning module is configured to perform irrelevant topic cleaning on the target topic data using the Weibo topic constraint model to obtain the cleaned target topic data.
- the Weibo sentiment analysis system further includes:
- the timing analysis module is used to arrange the sentiment types of the target topic data on the time axis according to the release time of the corresponding target topic data.
- FIG. 4 is a structural block diagram of the establishment subsystem of the microblog sentiment classifier provided by an embodiment of the present invention. As shown in FIG. 4, the establishment subsystem of the Weibo sentiment classifier includes:
- the classification training data collection module 401 is used for collecting several microblog text data as classification training data by using a general web crawler.
- the characteristic expression word obtaining module 402 is used to obtain characteristic expression words of the microblog text, and the characteristic expression words include positive expression words and negative expression words;
- the microblog data classification module 403 is used to classify the classification training data by using the characteristic expression words to obtain positive microblog data and negative microblog data.
- the positive microblog data is a microblog with positive expression words Data
- the negative microblog data is microblog data with negative expressions;
- a corpus construction module 404 used to select an equal number of positive Weibo data and negative Weibo data to form a corpus
- a classifier training module 405 is used to train the fastText classifier using the corpus to obtain the Weibo sentiment classifier.
- the establishment subsystem of the Weibo sentiment classifier further includes:
- the denoising processing module is used for denoising the classification training data to obtain the denoising processed training data.
- the denoising processing specifically includes:
- the establishment subsystem of the Weibo sentiment classifier further includes:
- the first judgment module is used to judge whether there is a negative emotion word in the emotion polarity dictionary in the positive microblog data to obtain a first judgment result;
- a first filtering module configured to filter out positive microblog data with negative emotion words when the first judgment result indicates that negative emotional words in an emotional polarity dictionary exist in the positive microblog data;
- a second judgment module used to judge whether there is a positive emotion word in the emotion polarity dictionary in the negative microblog data, and obtain a second judgment result
- the second filtering module is configured to filter out negative microblog data with positive emotion words when the second judgment result indicates that there are positive emotion words in the emotional polarity dictionary in the negative microblog data.
- the implementation process of the microblog sentiment analysis system provided by the present invention is as follows:
- the Weibo Focus Crawler combines the Weibo application program interface to obtain real-time data and historical data of Weibo on a specific topic as target topic data.
- the target topic data contains the time information published by each Weibo, which is used for timing analysis later.
- S301 Use regular expressions to match the most common @ and # symbols in Weibo, clean out the usernames attached to @ and @, and filter out all tags represented by # and #;
- S302 Use regular expressions to match and filter the URL links and email addresses in the Weibo text. After statistics, 670,000 of the 35 million Weibo texts collected contain URL links and email addresses, that is, every 100 Two of the data need to be cleaned.
- S303 Disassemble the common emoticons on the network to obtain a dictionary of special characters, and use the dictionary to filter the special characters in the microblog text;
- the cleaning of Weibo text has the following four steps: Weibo specific attribute cleaning, URL link and mailbox cleaning, special character cleaning and short Weibo cleaning.
- the source data has a total size of 6.34GB.
- Figure 6 shows the size of the remaining data after each cleanup. There are a lot of @ and ## symbols on Sina Weibo, which are used to refer to someone or tag Weibo. These unique attributes will bring noise in the training of the classifier, and due to the limitation of the length of the Weibo The lifting of the ban, a microblog may contain multiple tags, if you do not remove the tags, the later classifier training will be assigned a greater weight for the tags.
- the last pre-processing step is the filtering of short microblogs. After the cleaning of the above steps, the length of many microblog texts will be shortened. Microblogs with a character length of less than 5 are set as invalid microblogs. One Chinese character is calculated according to one character . After filtering, 2.28 million invalid short microblogs were filtered out. Finally, 33.48 million valid microblogs remained in the dataset, with a total size of 5.21GB.
- microblog feature expressions and emotional polarity dictionaries to conduct weakly supervised learning training on the data collected by the general web crawler, filter out microblogs with strong emotional colors, and serve as a corpus for microblog sentiment analysis.
- the microblog emoticons in step S4 are positive emoticons and negative emoticons with strong emotional pornography, and the NTpolar emotion dictionary is used in the emotion polarity dictionary.
- the NTpolar emotion dictionary is used in the emotion polarity dictionary.
- the final filtered weibo_sentiment_corpus contains stop words and does not follow the steps of filtering stop words in traditional text cleaning.
- the classifier based on the training set with stop words has better accuracy than none
- the classifier obtained by the stopword training is 0.4% higher.
- the microblog text in the corpus is text that has filtered out emoticons, which avoids that emoticons are given greater weight in the process of training the classifier, which affects the accuracy of classification.
- Weibo sentiment classifier In the training of Weibo sentiment classifier, 80% of Weibo in Weibo_senti-ment_corpus is selected as the training set, and 20% of Weibo as the test set. According to the test results of the test set, the classification accuracy of the classifier has reached 92.2%.
- step S7 Use the sentiment classifier generated in step S5 to classify the Weibo topic data filtered in step S6, including:
- step S702 For the latest data collected in S701, the microblog topic constraint model in step S6 is used to filter noise microblogs with unrelated topics;
- step S3 For the target topic microblog obtained by filtering in S702, step S3 is used for cleaning, and then stored in the database;
- S704 For the cleaned data in S703, use the Weibo sentiment classifier trained in step S5 to classify, and then synchronize the classification results to the database in chronological order.
- step S8 dynamically displaying the classification results in step S7 in chronological order, thereby realizing real-time sequential sentiment analysis of specific topics.
- the time series sentiment analysis on the specific topic of Weibo in step S8 belongs to the analysis of the application layer. Read all the classification results of the target topic Weibo from the database, and then draw the emotion classification results into a graph in chronological order.
- the axis of the graph is the time axis, and the axis is the number of microblogs. There are two curves in the graph. The curve above the axis represents the degree of positive emotions of a particular topic over time, and the curve below the axis represents the degree of negative emotions of the topic over time.
- the microblog sentiment classifier obtained by the microblog sentiment analysis method proposed by the present invention is divided into four layers, namely: a data acquisition and preprocessing layer, a model layer, a data storage layer, and an application layer.
- the general web crawler and the focused web crawler are responsible for data collection. After a short period of storage, the collected data is preprocessed.
- the Universal Reptile collected a total of 35 million Weibo posts, with a total file size of 6.34GB. Focused web crawlers can collect historical data and real-time data on any particular topic. This example selects the most popular topic in April 2018: ZTE Crisis, as an example to illustrate, and also uses the focused web crawler to collect the history related to ZTE topics from January 1, 2018 to May 1, 2018 A total of 38,000 Weibo texts.
- the data set generation model uses a weakly supervised learning method based on feature expression words and NTUSD sentiment dictionary, filtering 4.2 million positive emotion microblogs and 680,000 negative emotion microblogs. Since the overall sentiment of Weibo is more positive, and the number of positive feature expressions is twice that of negative expressions, the number of finally extracted emotional positive Weibos is much larger than that of negative Weibo.
- a microblog is given a larger prior probability of positive emotions, so randomly selected microblogs equal to the number of negative microblogs from the set of positive positive microblogs constitute Chinese A corpus of sentiment analysis on Weibo, used to train sentiment classifiers.
- the correlation range ⁇ is: 0 ⁇ 1.
- Every two words are separated from each other, and it is not based on the context to infer the semantics of a word. .
- the present invention is based on a large-scale data set. If the topic is narrowed to a specific field, the classification performance of the classifier should be further improved.
- the present invention analyzes ZTE’s topic Weibo in detail.
- Focus Web crawler collected 38,000 ZTE topics' microblogs, with an average of 310 microblogs per day and 13 microblogs per hour. This amount of data is sufficient to support time series sentiment analysis in units of days and hours. 38,000 microblogs were randomly selected from the focused crawler. These microblogs are most likely to be unrelated to ZTE topics. As an unrelated topic in the training set and 38,000 ZTE topic microblogs, they are sent to fastText for topic constraint model. Training. After the model is trained, re-classify 38,000 ZTE microblogs. If the relevance to ZTE topics is greater than or equal to 60%, that is, the probability of belonging to ZTE topics is greater than or equal to 60%, then the microblog As a topic related Weibo.
- the above analysis proves the feasibility of vertical timing analysis of Weibo topics. Different from the heat curves of Baidu Index and Weibo Micro Index, the analysis method proposed by the present invention not only takes into account the popularity of topic events, but also realizes the dynamic timing analysis of positive emotions and negative emotions, which can quickly and intuitively reflect the emotional changes of topics . This also proves the effectiveness and practicability of the microblog sentiment classifier trained by the present invention from an application perspective.
- the invention realizes the real-time and time-series sentiment analysis of microblog vertical topics, while taking into account the accuracy rate of the microblog sentiment classifier, and enhances the real-time and timeliness of the sentiment analysis of the microblog.
- the invention constructs a million-level Chinese microblog sentiment analysis corpus, which is currently the largest corpus in the field.
- the invention overcomes the problem of vector sparseness based on the bag of words model, and uses fastText to train distributed word vectors and sentiment classifiers, so as to learn more semantics of microblog short texts.
- the topic constraint model of microblog proposed by the present invention realizes the filtering of noise microblogs in the microblog data set of specific topics.
- the experimental results show that the accuracy rate of the Weibo sentiment classifier provided by the present invention reaches 92.2%, and the time-series sentiment analysis of the Weibo topic realized on this basis can also accurately reflect the sentiment trend of the topic.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (10)
- 一种微博情感分析方法,其特征在于,所述分析方法包括:采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据;将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博;所述微博情感分类器的建立方法具体包括:采用通用网络爬虫采集若干微博文本数据作为分类训练数据;获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;选取数量相等的积极微博数据和消极微博数据构成语料库;利用所述语料库对fastText分类器进行训练,获得所述微博情感分类器。
- 根据权利要求1所述的微博情感分析方法,其特征在于,所述获取微博文本的特征表情词之前,还包括:对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:过滤掉微博文本数据中的颜文字和符号;采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;过滤掉字符长度小于设定阈值的微博文本数据。
- 根据权利要求1所述的微博情感分析方法,其特征在于,所述选取数量相等的积极微博数据和消极微博数据构成语料库之前还包括:判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果;当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
- 根据权利要求1所述的微博情感分析方法,其特征在于,所述将各个所述目标话题数据输入微博情感分类器之前,还包括:随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据;利用所述约束训练数据对所述fastText分类器进行训练,获得微博话题约束模型;采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
- 根据权利要求1所述的微博情感分析方法,其特征在于,所述将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型之后,还包括:将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
- 一种微博情感分析系统,其特征在于,所述分析系统包括:目标话题数据采集模块,用于采用聚焦网络爬虫采集目标话题在预设时间段内的若干微博文本数据作为目标话题数据;情感分析模块,用于将各个所述目标话题数据输入微博情感分类器,获得各个所述目标话题数据的情感类型,所述微博情感分类器的输入为微博文本数据,所述微博情感分类器的输出为积极微博或消极微博;所述微博情感分类器的建立子系统具体包括:分类训练数据采集模块,用于采用通用网络爬虫采集若干微博文本数据作为分类训练数据;特征表情词获取模块,用于获取微博文本的特征表情词,所述特征表情词包括积极表情词和消极表情词;微博数据分类模块,用于利用所述特征表情词对所述分类训练数据进行分类,获得积极微博数据和消极微博数据,所述积极微博数据为带有积极表情词的微博数据,所述消极微博数据为带有消极表情词的微博数据;语料库构建模块,用于选取数量相等的积极微博数据和消极微博数据构成语料库;分类器训练模块,用于利用所述语料库对fastText分类器进行训练,获得所述微博情感分类器。
- 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分类器的建立子系统还包括:去噪处理模块,用于对所述分类训练数据进行去噪处理,获得去噪处理后的分类训练数据,所述去噪处理具体包括:过滤掉微博文本数据中的颜文字和符号;采用正则表达式对统一资源定位符链接和邮箱进行匹配过滤;过滤掉字符长度小于设定阈值的微博文本数据。
- 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分类器的建立子系统还包括:第一判断模块,用于判断所述积极微博数据中是否存在情感极性词典中的消极情感词,获得第一判断结果;第一过滤模块,用于当所述第一判断结果表示所述积极微博数据中存在情感极性词典中的消极情感词,则将存在消极情感词的积极微博数据滤除;第二判断模块,用于判断所述消极微博数据中是否存在情感极性词典中的积极情感词,获得第二判断结果;第二过滤模块,用于当所述第二判断结果表示所述消极微博数据中存在情感极性词典中的积极情感词,则将存在积极情感词的消极微博数据滤除。
- 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分析系统还包括:约束训练数据选取模块,用于随机选取与所述目标话题数据数量相同的微博文本数据作为约束训练数据;约束模型确定模块,用于利用所述约束训练数据对所述fastText分类 器进行训练,获得微博话题约束模型;不相关话题清洗模块,用于采用所述微博话题约束模型对所述目标话题数据进行不相关话题清洗,获得清洗后的目标话题数据。
- 根据权利要求6所述的微博情感分析系统,其特征在于,所述微博情感分析系统还包括:时序分析模块,用于将各个所述目标话题数据的情感类型按照对应的目标话题数据的发布时间排列在时间轴上。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811432829.3 | 2018-11-28 | ||
CN201811432829.3A CN109543110A (zh) | 2018-11-28 | 2018-11-28 | 一种微博情感分析方法及系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020108430A1 true WO2020108430A1 (zh) | 2020-06-04 |
Family
ID=65850645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/120584 WO2020108430A1 (zh) | 2018-11-28 | 2019-11-25 | 一种微博情感分析方法及系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109543110A (zh) |
WO (1) | WO2020108430A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115331A (zh) * | 2020-09-21 | 2020-12-22 | 朱彤 | 基于分布式网络爬虫与nlp的资本市场舆情监测方法 |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543110A (zh) * | 2018-11-28 | 2019-03-29 | 南京航空航天大学 | 一种微博情感分析方法及系统 |
CN109977231B (zh) * | 2019-04-10 | 2021-04-02 | 上海海事大学 | 一种基于情感衰变因子的抑郁情绪分析方法 |
CN110674415B (zh) * | 2019-09-20 | 2022-06-17 | 北京浪潮数据技术有限公司 | 一种信息显示方法、装置及服务器 |
CN110941759B (zh) * | 2019-11-20 | 2022-11-11 | 国元证券股份有限公司 | 一种微博情感分析方法 |
CN111078879A (zh) * | 2019-12-09 | 2020-04-28 | 北京邮电大学 | 基于深度学习的卫星互联网文本敏感信息检测方法及装置 |
CN111125548A (zh) * | 2019-12-31 | 2020-05-08 | 北京金堤科技有限公司 | 舆论监督方法和装置、电子设备和存储介质 |
CN111611455A (zh) * | 2020-05-22 | 2020-09-01 | 安徽理工大学 | 一种微博热点话题下基于用户情感行为特征的用户群体划分方法 |
CN111680132B (zh) * | 2020-07-08 | 2023-05-19 | 中国人民解放军国防科技大学 | 一种用于互联网文本信息的噪声过滤和自动分类方法 |
CN111986259A (zh) * | 2020-08-25 | 2020-11-24 | 广州市百果园信息技术有限公司 | 颜文字检测模型的训练、视频数据的审核方法及相关装置 |
CN112559746A (zh) * | 2020-12-11 | 2021-03-26 | 南京邮电大学 | 一种产品评论挖掘方法和系统 |
CN116562302A (zh) * | 2023-06-29 | 2023-08-08 | 昆明理工大学 | 融合汉越关联关系的多语言事件观点对象识别方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407449A (zh) * | 2016-09-30 | 2017-02-15 | 四川长虹电器股份有限公司 | 一种基于支持向量机的情感分类方法 |
US20180032870A1 (en) * | 2015-10-22 | 2018-02-01 | Tencent Technology (Shenzhen) Company Limited | Evaluation method and apparatus based on text analysis, and storage medium |
CN108536674A (zh) * | 2018-03-21 | 2018-09-14 | 上海蔚界信息科技有限公司 | 一种基于语义的典型意见聚合方法 |
CN109543110A (zh) * | 2018-11-28 | 2019-03-29 | 南京航空航天大学 | 一种微博情感分析方法及系统 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390051B (zh) * | 2013-07-25 | 2016-07-20 | 南京邮电大学 | 一种基于微博数据的话题发现与追踪方法 |
WO2017051425A1 (en) * | 2015-09-23 | 2017-03-30 | Devanathan Giridhari | A computer-implemented method and system for analyzing and evaluating user reviews |
-
2018
- 2018-11-28 CN CN201811432829.3A patent/CN109543110A/zh active Pending
-
2019
- 2019-11-25 WO PCT/CN2019/120584 patent/WO2020108430A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180032870A1 (en) * | 2015-10-22 | 2018-02-01 | Tencent Technology (Shenzhen) Company Limited | Evaluation method and apparatus based on text analysis, and storage medium |
CN106407449A (zh) * | 2016-09-30 | 2017-02-15 | 四川长虹电器股份有限公司 | 一种基于支持向量机的情感分类方法 |
CN108536674A (zh) * | 2018-03-21 | 2018-09-14 | 上海蔚界信息科技有限公司 | 一种基于语义的典型意见聚合方法 |
CN109543110A (zh) * | 2018-11-28 | 2019-03-29 | 南京航空航天大学 | 一种微博情感分析方法及系统 |
Non-Patent Citations (2)
Title |
---|
GUO, JIE: "Research and Application of Sentiment Classification Technology Based on Web Comments", CHINESE MASTER'S THESES FULL-TEXT DATABASE (ELECTRONIC JOURNAL), INFORMATION SCIENCE AND TECHNOLOGY, 31 August 2018 (2018-08-31), DOI: 20200217120004PX * |
WAN, SHUO: "Vertical and Sequential Sentiment Analysis of Micro-blog Topic", INTERNATIONAL CONFERENCE ON ADVANCED DATA MINING AND APPLICATIONS ADMA 2018, 29 December 2018 (2018-12-29), DOI: 20200217115908PX * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115331A (zh) * | 2020-09-21 | 2020-12-22 | 朱彤 | 基于分布式网络爬虫与nlp的资本市场舆情监测方法 |
CN112115331B (zh) * | 2020-09-21 | 2021-05-04 | 朱彤 | 基于分布式网络爬虫与nlp的资本市场舆情监测方法 |
Also Published As
Publication number | Publication date |
---|---|
CN109543110A (zh) | 2019-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020108430A1 (zh) | 一种微博情感分析方法及系统 | |
Saad et al. | Twitter sentiment analysis based on ordinal regression | |
CN106980692B (zh) | 一种基于微博特定事件的影响力计算方法 | |
TWI653542B (zh) | 一種基於網路媒體資料流程發現並跟蹤熱點話題的方法、系統和裝置 | |
Hammad et al. | An approach for detecting spam in Arabic opinion reviews | |
CN111143576A (zh) | 一种面向事件的动态知识图谱构建方法和装置 | |
TWI501097B (zh) | 文字串流訊息分析系統和方法 | |
CN106940732A (zh) | 一种面向微博的疑似水军发现方法 | |
CN103617290B (zh) | 中文机器阅读系统 | |
CN104216964B (zh) | 一种面向微博的非分词突发话题检测方法 | |
CN105354216B (zh) | 一种中文微博话题信息处理方法 | |
Pan et al. | Deep neural network-based classification model for Sentiment Analysis | |
Srikanth et al. | [Retracted] Sentiment Analysis on COVID‐19 Twitter Data Streams Using Deep Belief Neural Networks | |
Alp et al. | Extracting topical information of tweets using hashtags | |
CN111783456A (zh) | 一种利用语义分析技术的舆情分析方法 | |
Nahar et al. | Sentiment analysis and emotion extraction: A review of research paradigm | |
CN110019763B (zh) | 文本过滤方法、系统、设备及计算机可读存储介质 | |
Dhanalakshmi et al. | Sentiment analysis using VADER and logistic regression techniques | |
CN105205075B (zh) | 基于协同自扩展的命名实体集合扩展方法及查询推荐方法 | |
Zhang et al. | Spam comments detection with self-extensible dictionary and text-based features | |
CN109871889A (zh) | 突发事件下大众心理评估方法 | |
CN111221941B (zh) | 基于文本内容和行文风格的社交媒体谣言鉴别算法 | |
Suresh | An innovative and efficient method for Twitter sentiment analysis | |
Li et al. | Identification of public opinion on COVID-19 in microblogs | |
Kumar et al. | Real-time hashtag based event detection model with sentiment analysis for recommending user tweets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19891385 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19891385 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19891385 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 18/03/2022) |