CN112883286A - BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation - Google Patents

BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation Download PDF

Info

Publication number
CN112883286A
CN112883286A CN202011445578.XA CN202011445578A CN112883286A CN 112883286 A CN112883286 A CN 112883286A CN 202011445578 A CN202011445578 A CN 202011445578A CN 112883286 A CN112883286 A CN 112883286A
Authority
CN
China
Prior art keywords
information
data
microblog
epidemic situation
public opinion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011445578.XA
Other languages
Chinese (zh)
Inventor
蒋泓毅
刘薇
肖焯
姜青山
谭忠
陈会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Xiamen University
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University, Shenzhen Institute of Advanced Technology of CAS filed Critical Xiamen University
Priority to CN202011445578.XA priority Critical patent/CN112883286A/en
Publication of CN112883286A publication Critical patent/CN112883286A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Abstract

The invention provides a BERT-based method, equipment and a medium for analyzing microblog emotion of new coronary pneumonia epidemic situation, wherein the method comprises the following steps: acquiring microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation; preprocessing microblog public opinion information to obtain microblog public opinion data; performing sentiment analysis on the microblog public sentiment data through a language model to obtain a sentiment analysis result; and processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data to obtain the correlation analysis of the new coronary pneumonia epidemic situation. According to the invention, a set of emotion analysis model integrating data crawling, cleaning, storing, analyzing and visualizing is constructed, epidemic data and microblog public opinion data are integrated, the hyper-parameters of the model are adjusted on the data set, the accuracy of the model is improved, finally, the continental epidemic development condition is summarized and analyzed in stages, the correlation between the outbreak situation of new coronary pneumonia and the change of public opinion is obtained, and the emotion analysis model has important significance for the public opinion monitoring and obtaining during the epidemic period.

Description

BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation
Technical Field
The invention relates to the field of public opinion analysis, in particular to a BERT-based microblog emotion analysis method, equipment and medium for new crown pneumonia epidemic situation.
Background
The novel coronavirus pneumonia has been widely concerned by domestic and foreign society as an acute respiratory infectious disease since the explosion, and is also the focus of social public opinion. The new coronary pneumonia epidemic situation has a pandemic characteristic, and the spreading degree and the severity of the epidemic situation are seriously worried. Therefore, the analysis of the overall situation of the domestic epidemic and the related public opinion development situation is helpful for helping people to better have clear and intuitive cognition on the current epidemic development stage and the corresponding public opinion response, and make a decision on the basis.
The existing research aiming at epidemic public opinion analysis mainly focuses on the following aspects: topic mining, sentiment analysis, public opinion dissemination mechanisms and countermeasures.
Aiming at the aspect of topic mining, in the Yang jade article reference, epidemic news of 1389 people's network which is totally accumulated between 20 days 1 month and 3 months 22 days 2020 is captured by Python in the topic mining and emotion analysis research of ' New crown pneumonia epidemic ', the hot topic related to ' New crown pneumonia epidemic ' is displayed by utilizing data preprocessing, feature word extraction and word cloud visualization, and then the public opinion evolution trend is mined by adopting common word analysis, an LDA model, a knowledge graph and an emotion analysis algorithm based on SnowNLP.
In the research of the Koke public opinion analysis method in the public health event in the reference literature, which takes the new crown pneumonia epidemic as an example, 33 thousand pieces of new microblog data about the new crown pneumonia from 1 month and 18 days 2020 and 1 month and 28 days 2020 and based on algorithms such as spatial clustering of Louvain and Kmeans, improved extraction of BTM subject terms and the like, the user attention hotspot information and the emotional characteristics are used as region labels, and a public opinion evaluation method reflecting the emotional characteristics, region association and hotspot attention is constructed, so that the position-based information fusion is realized, and the public opinion characteristics and the attention theme difference in different regions can be analyzed.
The Roche takes the Xinlang microblog as a data source in the reference of 'Chinese public opinion space-time evolution characteristics in epidemic prevention and control', a topic extraction and analysis model is constructed based on a potential Dirichlet allocation topic model and a random forest algorithm, 13 public opinion topics in microblog texts are identified, and the public opinion topics are analyzed from the aspects of quantity, space, time, content and the like from 1 month and 9 days to 3 months and 10 days in 2020North of a lakeProvince, Jingjin Ji, Long triangle, bead triangle, formed Yu and other cities and along the port and other heavy point region distribution characteristics.
The Cao Tree gold combines a life cycle theory, a TF-IDF characteristic word-weight model and a potential Dirichlet model method in 'sudden public health event microblog public opinion topic mining and evolution analysis' of a reference document, integrates time dimension into microblog text analysis, carries out topic mining work comprising time sequence, mines implicit topic information and public opinion evolution rules, and provides a corresponding public opinion control strategy.
Aiming at the aspect of emotion analysis, the zhhen is used as an information base in 'new crown epidemic situation and public opinion evolution analysis based on user emotion change' and microblog comments of 'daily epidemic situation announcement' in the period from 23 days 1 to 8 days 4 and 2020, firstly, a Chinese natural language processing tool snowNLP is used for extracting emotion tendentiousness of a corpus to finish positive and negative emotion analysis, then, clustering analysis of the text corpus is realized based on a Single-Pass clustering algorithm, an epidemic situation hot spot topic is explored, and finally, information mining of public opinion attention is realized by using a Louvain community discovery algorithm.
Aiming at public opinion propagation mechanism aspects, Zhao Xueyi explores network public opinion propagation main body, propagation content and propagation period rules in major public health emergencies by crawling new crown pneumonia microblog public opinion data in Xinghai microblog in 'network public opinion propagation mechanism research and reflection of major public health emergencies-taking new crown pneumonia epidemic situation as an example'; propagating content presentation relevance and transitioning quickly; public opinion transmission is periodic and prone to secondary outbreaks.
Aiming at the aspect of coping measures, the Li just utilizes a retardation growth model public opinion development to carry out pre-judgment analysis and a fuzzy evaluation model to evaluate a leadership style in the 'COVID-19 public opinion coping and leadership decision dual-drive emergency treatment mechanism research', adopts general regression analysis, adds intermediary variables such as epidemic and epidemic propagation fault and the like, evaluates the emergency treatment of a disease control department under the dual functions of network public opinion and leadership style to solve the public opinion, carries out the emergency treatment research of the disease control department under the dual functions of network public opinion and leadership style by virtue of a provincial market sample with higher exposure, and provides reasonable suggestions for similar disease control departments.
The following defects generally exist in the prior relevant technology:
(1) aiming at the research of epidemic situation emotion analysis direction, no complete set of methods covering data acquisition, pretreatment, emotion analysis and evaluation analysis exists at present;
(2) the existing technical scheme aiming at epidemic situation sentiment analysis does not carry out super-parameter adjustment on a sentiment analysis model aiming at a epidemic situation related microblog public sentiment data set, and the accuracy rate is poor;
(3) the existing technical scheme aiming at epidemic emotion analysis does not form subsequent result evaluation analysis on the emotion analysis result aiming at the epidemic, and does not consider the relation between the change condition of the epidemic and the development of the public opinion.
Therefore, a method for collecting, combing, integrating and analyzing case information and microblog public opinion information related to epidemic situations is needed to research the epidemic situation emotion analysis.
Disclosure of Invention
Based on the problems in the prior art, the invention provides a BERT-based microblog emotion analysis method, equipment and medium for new crown pneumonia epidemic situation. The specific technical scheme is as follows:
a BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation comprises the following steps: s1, acquiring microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation; s2, preprocessing the microblog public opinion information to obtain microblog public opinion data S3, and performing sentiment analysis on the microblog public opinion data through a language model to obtain a sentiment analysis result; s4, processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data, and obtaining correlation analysis of the new coronary pneumonia epidemic situation.
Wherein the correlation analysis comprises epidemic situation development analysis and epidemic situation emotion analysis; the epidemic situation feeling analysis comprises correlation analysis of epidemic situation outbreak and public sentiment emotion change.
Specifically, the public sentiment emotion change is decomposed into: the total number of comments, the number of positive comments, the number of neutral comments and the number of negative comments, and the epidemic situation outbreak condition is decomposed into: the total number of newly-added cases, the number of newly-added input cases and the number of newly-added local cases; and carrying out correlation test according to the index disassembled from the public opinion emotion change and the epidemic situation outbreak condition.
In particular, the correlation test comprises in particular: and sequentially carrying out normal distribution test, correlation coefficient calculation and significance test on the indexes.
Specifically, the S1 specifically includes: capturing contents in the internet by a Python crawler compiling program, and extracting microblog public opinion information and epidemic situation data related to the new crown pneumonia epidemic situation from the contents; and storing the microblog public opinion information and the epidemic situation data in a database.
Specifically, the epidemic situation data is statistics of the number of confirmed, suspected, cured and dead cases in the area, and statistics of input cases in the area; the microblog public opinion information is microblog content related to the new crown pneumonia epidemic situation on a microblog; the microblog public opinion information comprises comment information, and the comment information is related comment information issued by netizens on microblogs aiming at the new crown pneumonia epidemic situation.
More specifically, the comment information is provided with emotion labels, and the comment information is divided into: positive comment on information: comment data with positive emotional tendency; negative comment information: comment data with negative emotional tendencies; neutral comment information: comment data without emotional tendency; abnormal comment information: comment data for emotional tag anomalies.
Specifically, the S2 specifically includes: s21, performing data perception and cleaning on the microblog public sentiment information to obtain first information; s22, converting the format of the text data of the first information to obtain second information; s23, performing word segmentation processing on the second information to obtain third information; s24, performing stop word processing on the third information to obtain fourth information; and S25, performing feature conversion processing on the fourth information to obtain the microblog public opinion data.
More specifically, at S21, the method specifically includes: data perception and cleaning are conducted on the positive comment information, the negative comment information and the neutral comment information; judging the emotion label again for the abnormal comment information; and/or in S22, specifically including: converting the code of the first information into a UTF-8 code to obtain the second information; and/or in S23, specifically including: based on a Chinese word segmentation library, carrying out word segmentation processing on the second information through a hidden Markov model to obtain third information; and/or in S24, specifically including: sequentially scanning the words of the third information based on a stop word list, if the words are in the stop word list, removing the words, and if the words are not in the stop word list, keeping the words; and/or in S25, specifically including: and converting the fourth information into vector data from text data by performing word embedding processing, fragment embedding processing and position embedding processing on the fourth information, and acquiring microblog public opinion data.
Specifically, the S3 specifically includes: s31, dividing the microblog public opinion data into a training set, a verification set and a test set; s32, obtaining a language model according to the verification set and the test set; s33, performing emotion analysis on the training set according to the language model to obtain emotion analysis results; wherein, the S32 specifically includes: loading a language model; training the language model by using the verification set to obtain a trained language model; testing the trained language model by using the test set; and acquiring the language model passing the test.
In particular, the language model comprises a BERT model; the input of the BERT model is a comprehensive expression vector consisting of word embedding, segment embedding and position embedding.
More specifically, the loss function of the BERT model is softmax cross entropy, which is calculated as follows:
Figure RE-GDA0003031980960000051
wherein L is a loss function, SjFor the jth value of the softmax cross-entropy output vector S, which represents the probability that this sample belongs to the jth class, T is the number of classes, yjIs a label vector.
More specifically, the text direction of the BERT model is set as bidi, the dropout probability of the hidden layer is set as 0.1, the neuron number of the hidden layer of the encoder is 3072, the size of the pooling layer is 128, the number of heads of the attention mechanism is 12, and the number of layers is 3.
A computer device, the computer device comprising: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the BERT-based new crown pneumonia epidemic situation microblog emotion analysis method.
A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a BERT-based new crown pneumonia epidemic situation microblog emotion analysis method as described above.
The invention has the following beneficial effects: the invention provides a BERT-based microblog emotion analysis method, equipment and medium for a new crown pneumonia epidemic situation, and overcomes the defects of the prior art. By constructing a set of integrated crawling, cleaning, storing, analyzing and visualizing methods, automatic updating, sorting, analyzing and visualizing of data can be realized. Acquiring related structured and unstructured data of the new coronary pneumonia epidemic situation, performing emotion analysis by using a BERT model after parameter adjustment, and generating a visual analysis result; training and parameter adjustment are carried out on the BERT model by using a manually labeled epidemic situation microblog emotion training set, so that the accuracy of the model for epidemic situation data is improved; aiming at the public opinion sentiment analysis result, the method provided by the invention covers the relation discussion between the public opinion sentiment analysis result and the epidemic situation case number, and covers the word cloud picture drawn in a time-sharing way, so that the condition between the development of the epidemic public opinion and the outbreak of the epidemic situation is more effectively explained.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic flow chart of a BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation;
fig. 2 is a schematic flow chart of a method for analyzing microblog emotion of new crown pneumonia epidemic situation based on BERT according to an embodiment of the present invention, taking microblog as an example;
fig. 3 is a histogram of distribution of tag data tags of the epidemic microblog comments provided by the embodiment of the invention;
FIG. 4 is a schematic diagram of pretreatment of a BERT-based microblog emotion analysis method for epidemic situation of new coronary pneumonia;
FIG. 5 is a BERT model structure diagram of a BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation;
FIG. 6 is a representation of model inputs provided by an embodiment of the present invention;
FIG. 7 is a flowchart of model training of a BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation;
FIG. 8 is a flow chart of model testing of a BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation;
FIG. 9 is a diagram of the emotion analysis result distribution of the local epidemic outbreak period according to the embodiment of the present invention;
FIG. 10 is a diagram of the emotion analysis results profile of the outbreak period of an input case proposed by an embodiment of the present invention;
FIG. 11 is a diagram of the emotional analysis results of a small-scale reexplosion period proposed by an embodiment of the present invention;
FIG. 12 is a structural diagram of a BERT-based microblog emotion analysis system for epidemic situation of new coronary pneumonia according to an embodiment of the present invention;
FIG. 13 is a block diagram of a processing unit of a BERT-based microblog emotion analysis system for epidemic situation of new coronary pneumonia;
FIG. 14 is a diagram of an analysis unit of a BERT-based microblog emotion analysis method for epidemic situation of new coronary pneumonia;
fig. 15 is a schematic diagram of a computer to which the BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation is applied.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a BERT-based microblog emotion analysis method, equipment and medium for new crown pneumonia epidemic situation. The epidemic public opinion method integrates data crawling, cleaning, storing, analyzing and visualizing, and the epidemic public opinion analysis system using the method can cover automatic updating, sorting, analyzing and visualizing of the data. The method takes the new coronary pneumonia epidemic situation as an example, summarizes and analyzes the epidemic situation development status by stages through collecting relevant data of the epidemic situation, public opinion analysis modeling and emotion analysis evaluation analysis, and finally obtains the correlation analysis of the epidemic situation and the public opinion.
Example 1
Aiming at the defects of the prior art, the embodiment provides a BERT-based microblog emotion analysis method for the new crown pneumonia epidemic situation. As shown in the attached figure 1 of the specification, the specific scheme is as follows:
s1, acquiring microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation;
s2, preprocessing microblog public sentiment information to obtain microblog public sentiment data;
s3, performing sentiment analysis on the microblog public sentiment data through the language model to obtain a sentiment analysis result;
and S4, processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data to obtain the epidemic situation correlation analysis.
Specifically, S1, microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation are obtained. This embodiment adopts the mode of Python reptile to carry out data acquisition. The framework of the Scapy crawler was used. The script is an open source crawler frame written by using a Python language, and can capture webpage contents in the internet and extract structured data from the webpage contents. In order to better manage data, the data crawled by different platforms are stored in a database, and are cleaned and screened, and are sorted into data which can be directly called by subsequent public opinion modeling and visual analysis and are stored in different database tables. In addition, Python was used for crawling and marshalling into Pandas Dataframe format, and then the data was written into the MongoDB database also using Python. And after the data is crawled, the storage can be automatically finished. The stored data can be recalled at any time. As shown in figure 2 of the specification.
The epidemic situation data comprises the statistics of nationwide confirmed diagnosis, suspected diagnosis, cure and death cases: the province and the city can carry out the statistics of confirmed, suspected, dead and cured cases nationwide. The data acquisition platform is a clove garden, the data is updated once a day, and information of the previous day is extracted at about nine morning hours each day. Epidemic data also includes national input case statistics: the nationwide input cases are counted by the input place (province) and the input case source place (country). The data acquisition platform updates the data one day for every Wei-Jian Commission, and extracts the information of the previous day at about five pm every day in consideration of the updating speed of the original website. The microblog public opinion information is social content related to epidemic situation on a social network site, and the embodiment takes microblog as an example to obtain microblog related to epidemic situation: and acquiring microblogs related to epidemic situations from a website, crawling data by days, and crawling microblogs on the previous day at a zero point on the new day. The data collected is summarized in table 1:
TABLE 1 epidemic data and microblog public opinion information description
Figure RE-GDA0003031980960000091
Specifically, S2, the microblog public sentiment information acquired in S1 is preprocessed, and microblog public sentiment data are acquired. Particularly, the embodiment mainly aims at processing comment information in the microblog public opinion information, namely epidemic situation microblog comment data related to epidemic situations. Epidemic situation microblog comment data refers to netizen comment data which is crawled from a microblog and is related to an epidemic situation. For better analysis, data of a data fountain epidemic microblog public data set is collected, and the data set is provided with positive emotion labels and negative emotion labels. The data and its labels are for example as follows:
TABLE 2 COVID-19 microblog comment data and tag examples thereof
Figure RE-GDA0003031980960000101
In addition, there is some abnormal data in which the emotion label is abnormal. The embodiment mainly processes three types of data, namely epidemic microblog comment data with positive emotional tendency, epidemic microblog comment data with negative emotional tendency and epidemic microblog comment data without emotional tendency/neutrality. The data preprocessing will be explained in steps below.
And S21, performing data perception and cleaning on the comment information to obtain first information. Firstly, carrying out data perception and cleaning on about 10 ten thousand pieces of epidemic microblog comment data with labels. And drawing a label distribution histogram of the label of the text data with the label, as shown in the attached figure 3 of the specification. As can be seen from the label distribution histogram, abnormal data exists in the label, such as labels marked as "10", "2", etc. For the data with abnormal labels, a manual processing method is adopted to judge the emotional tendency (positive, negative and neutral) of the comments and endow the comments with labels (1, -1 and 0) again.
And S22, converting the text data format of the first information to obtain second information. And converting the text data format of the epidemic situation microblog comment data processed by the S21. The original epidemic situation microblog comment data are coded by GB 2312. However, when the Python is used to process the Chinese text data, there is a problem of Chinese messy codes due to Chinese encoding. Therefore, data format conversion is needed, and epidemic microblog comment data originally coded by GB2312 is converted into UTF-8 codes to obtain second information.
And S23, performing word segmentation processing on the second information to acquire third information. In order to segment the whole sentence of epidemic microblog comment data into units of words, the text needs to perform word segmentation on the second information. The text adopts a jieba Chinese word segmentation library. The method adopts a fast full-mode word segmentation method for each epidemic situation microblog comment text data to be segmented, and scans all words which can be segmented in sentences. And (3) constructing fine-grained participles by using an HMM (Hidden Markov Model) Model during word segmentation, and acquiring third information.
And S24, performing stop word processing on the third information to acquire fourth information. And stopping words from the epidemic situation microblog comment data of the good words. To save memory and improve search efficiency, some useless words, such as tone words, punctuation marks, auxiliary words, etc., are automatically filtered before or after processing the natural language data (or text). The word segmentation result of the third information is sequentially scanned by adopting a disabled word list of the Harbin university of industry, if the word is in the disabled word list, the word is removed, if the word is not in the disabled word list, the word is retained, and finally the fourth information is obtained.
And S25, performing feature conversion processing on the fourth information to obtain microblog public opinion data. The method comprises the steps of selecting existing epidemic microblog comment data as a corpus, and embedding words by adopting a word2vec model. The word2vec word vector model maps words from the original belonged space to a new low-dimensional space, so that the distances of semantically similar words in the space are similar, and because the distribution of semantically similar words in the vector space is relatively similar, the semantic similarity among the words can be represented by calculating the space distance among word vectors, so that the word2vec word vector has good semantic characteristics. In addition, the embodiment also performs position embedding and segment embedding on the fourth information, and the word embedding, the segment embedding and the position embedding are constructed into a comprehensive expression vector which is input by the language model. And converting the fourth information from text data into vector data to obtain microblog public opinion data. A complete example of the pre-processing is shown in figure 4 of the specification.
Specifically, S3, performing sentiment analysis on the microblog public sentiment data through the language model, and obtaining a sentiment analysis result. The language model selected in the embodiment is a BERT model, a BERT-based epidemic situation microblog comment data emotion analysis model is established, depth bidirectional representation is obtained by considering context information on two sides in all layers, and after the representation is connected with an upper output layer, good effects can be obtained in a plurality of natural language processing tasks only through fine training. BERT includes two steps, pre-training, in which a model is trained on different pre-training tasks based on unlabeled data, and fine tuning. In fine tuning, the model is first initialized based on pre-trained parameters, and then all parameters are fine tuned using label data from specific tasks downstream. Each downstream task has an independent fine tuning model even though it is initialized with the same pre-training parameters. The BERT model is characterized in that the model structure is uniform for different tasks, and the pre-training architecture and the final downstream architecture have only slight difference. The model structure of BERT is a multi-layer bi-directional converter. The structure diagram of the BERT model used in the method is shown in the attached figure 5 in the specification. As can be seen from fig. 5 in the specification, the structure is a multi-layer bidirectional converter, the encoder of the converter is characterized by focusing on the context on both sides, and the decoder of the converter only uses the context on the left side (by the masking method). As used herein, the number of converter stack layers is 12, the hidden vector dimension is 768 dimensions, and the total parameter number is 110M.
The input representation of the BERT model needs to be able to unambiguously represent a single sentence in a sequence, whose input is represented by its corresponding word embedding, segment embedding and position embedding for a given token, as shown in fig. 6 of the specification. In the attached figure 6 of the specification, the sentence "good and difficult to pass and have a fever is infected," wherein the sentence "good and difficult to pass", "have a fever" and "have an infection is a segment of the sentence, which is respectively marked as A, B and C. The first flag bit of the sentence is a special parser [ CLS ], and the final hidden state of the flag bit is used for aggregating sequences and executing parsing tasks. The segments in the sentence are distinguished by two ways: the first is to insert a special mark [ SEP ] between two sentences, and the second is to add a learnable embedded vector to indicate whether the sentence belongs to the sentence A, B or C for each position of the divided segment. For the model, the input is a comprehensive vector representation obtained by adding word embedding, fragment embedding and position embedding of epidemic microblog comment data.
A training flow chart for carrying out epidemic situation microblog comment data public opinion analysis based on a BERT model is shown in an attached figure 7 in the specification, and the training process shows that the flow chart is divided into two steps: pre-training and fine tuning. Firstly, training set, verification set and test set division are carried out on epidemic microblog comment data, and about one hundred thousand pieces of microblog comment data are marked manually. And dividing 90% of the manually marked epidemic microblog comment data into a training set, 5% of the data into a verification set and 5% of the data into a test set. Unlike the word embedding described above as the input to the model, the input to the model is a comprehensive representation of a comment, i.e., word embedding, segment embedding, and position embedding are added as the input to the model. The pre-training follows the general process of pre-training the language model, namely loading the pre-training model, and inputting the characteristics obtained by the test data after characteristic conversion. Then, a fine tuning section performs parameter tuning of the model using the verification set. And then selecting the model parameter with the highest score on the verification set, outputting the model parameter as the final model and storing the model parameter to obtain the epidemic situation microblog comment public opinion analysis model based on the BERT. The trained model is then tested, and the test flow chart is shown in the specification and the attached figure 8. As mentioned above, the test is carried out on the BERT-based epidemic situation microblog comment public opinion analysis model by taking 5% of data as a test set. Firstly, data format conversion is carried out on test set data, and an original GB2312 coding format is converted into UTF-8 coding. And then, text segmentation and feature conversion are carried out, and epidemic microblog comment text data are converted into a comprehensive feature expression of word embedding, segment embedding and position embedding and input into a trained epidemic microblog comment public opinion analysis model based on BERT, so as to obtain a label of test data.
The pre-training model used herein is BERT-Base, Chinese, which can support both simplified and traditional Chinese. The number of stacked converter layers used is 12, the hidden vector dimension is 768 dimensions, and the total parameter number is 110M. The input of the model is a COVID-19 microblog comment text data comprehensive expression vector which is formed by word embedding, segment embedding and position embedding of the text data, and the output of the model is the emotional tendency of the COVID-19 microblog comment. The loss function of the model is softmax cross entropy, and the calculation formula is as follows:
Figure RE-GDA0003031980960000131
wherein L is a loss, SjIs the jth value of the output vector S of softmax, indicating the probability that this sample belongs to the jth class, yjIs a label vector. T is the number of categories, j is 1,2, …, T.
Through tests, the accuracy of the BERT-based epidemic situation microblog public opinion analysis model reaches 72.16%, specific parameters in an experimental model are set as shown in the following table, a text direction is set as bidi to indicate two directions, the dropout probability of attention machine intelligence is set as 0.1, the number of neurons in a hidden layer of an encoder is 3072, the size of a pooling layer is 128, the number of heads of an idea machine is 12, the number of layers is 3, and the concrete steps are as follows:
TABLE 3 model parameter settings
Figure RE-GDA0003031980960000132
Figure RE-GDA0003031980960000141
The hidden layer activation function gelu (gaussian error linear unit) is formulated as follows:
gelu(x)=xP(X≤x)=xΦ(x)
wherein Φ (X) is a probability function of normal distribution, the probability function is represented by normal distribution in an experiment, X is hidden layer output, and X follows normal distribution.
And acquiring correlation analysis of the new coronary pneumonia epidemic situation according to the acquired emotion analysis result. The correlation analysis comprises epidemic situation development analysis and epidemic situation emotion analysis; the epidemic situation feeling analysis comprises correlation analysis of epidemic situation outbreak and public sentiment change. And performing correlation analysis and word cloud picture drawing on the network public sentiment data, verifying the outbreak degree of the epidemic situation at different periods, namely the relationship between the newly added total number of the epidemic case at different periods and the emotion of the network public sentiment, and observing the attention points and the discussion key points of the network public sentiment after drawing the word cloud picture according to the word frequency at different periods.
Three-stage analysis result of epidemic situation development: and (3) in combination with the actual development condition of the continental epidemic, in order to better discuss the development condition of the epidemic, discussing the analysis result in stages and calling a Python Wordcloud module to draw a word cloud picture. The main of the local case is not found in the early stage of the development condition of the continental epidemic situation, but the local case is basically controlled in the middle and last ten days of March; in the first ten days of March, with the drastic increase of the number of republic people, input cases sharply increase; in June, there was a small-scale reexplosion in XX, YY and ZZ province. Therefore, the three stages of the specific division are: local epidemic outbreak period (12-2020 and 8 months in 2019), input case outbreak period (3-8 months in 2020), and small-scale reexplosion period (6-8 months in 2020).
The emotion analysis result distribution diagram of the local epidemic outbreak period is shown in the specification and the attached figure 9. The number of microblog comments in the initial stage of the local epidemic situation outbreak period is small, and then the local epidemic situation outbreak period reaches a maximum value in a short time, so that the situation that the attention degree in the initial stage of the situation is low and the heat rises suddenly along with the exposure of the event is reflected. The appearance of the words such as Huoshan, retrograde motion, monitoring, sealing, construction, constructors, refueling and the like reflects the response measures and positive response attitude of China for epidemic situations to be rapidly taken.
The emotion analysis result distribution diagram of the input epidemic outbreak period is shown in the attached figure 10 of the specification. The microblog comments in the outbreak period of the input case are generally maintained in a relatively high-level steady state, and only in 4 months and 4 days (mourning nationwide) are greatly increased. The names of a plurality of countries such as A country, B country and C country appear, and are related to that the domestic epidemic situation is basically controlled at the moment, and the number of overseas epidemic cases is greatly increased.
The distribution of the emotion analysis results of the small-scale reexplosion period is shown in the attached figure 11 of the specification. The microblog comment number in the small-scale re-outbreak period shows a fluctuation situation. Keywords related to areas, such as XX city, reflect to some extent that small-scale outbreak areas are of higher interest.
And (3) correlation test: when discussing the relevance between the epidemic outbreak situation and the public sentiment emotion change, the two contents need to be disassembled first. According to the collected data, the epidemic situation outbreak condition is disassembled into: the total number of newly-added cases, the number of newly-added input cases and the number of newly-added local cases; resolving public sentiment emotion change into: the total number of comments, the number of positive comments, the number of neutral comments, the number of negative comments, and the relevance between each two of them is discussed in the above indexes.
Next, a correlation test is performed on the above indexes, and the total correlation test can be divided into the following three steps: normal distribution test, correlation coefficient calculation and significance test.
In selecting the correlation coefficient class, a K-S test is used to determine whether both variables obey a normal distribution, or approximate a normal distribution. If compliant, a Pearson coefficient may be used.
Kolmogorov-Smirnov (K-S test) is a test method that compares one frequency distribution f (x) with either a theoretical distribution g (x) or two observed distributions. The original assumption is H0 that the two data distributions are consistent or that the data conforms to a theoretical distribution.
D=max|f(x)-g(x)|
H0 is rejected when the actual observations D > D (n, a), otherwise the H0 hypothesis is accepted.
The Pearson correlation coefficient is a measure of the correlation (linear correlation) between two variables X and Y, and has a value between-1 and 1. The pearson correlation coefficient between two variables is defined as the quotient of the covariance and the standard deviation between the two variables:
Figure RE-GDA0003031980960000161
where cov (X, Y) denotes the covariance between X and Y, σx、σyX and Y standard deviations are indicated, respectively.
And finally, a correlation test part can be carried out, and because the correlation coefficient has randomness, the probability of no error is required to be verified. The correlation and significance tests are two different concepts, but the significance tests here can illustrate whether the correlation results from a chance or not. The original assumption H0 is that the correlation between the assumed variables is zero. If P <0.05, the former assumption is not satisfied, and the probability of this occurrence is low, otherwise, the correlation between the variables can be considered to exist under the confidence of 95%, namely the correlation is significant. If P >0.05, the original hypothesis is rejected, and the correlation between the explanatory variables is not significant.
The correlation coefficient matrix obtained by using the corresponding data of newly added confirmed people, input confirmed people, local confirmed people, total discussion number, positive emotion comment number, negative emotion comment number and neutral emotion comment number in the period from 25 days 1/2020 to 29 days 8/2020 within 32 weeks is as follows:
TABLE 4 correlation coefficient matrix of case number and microblog number
Confirmation of diagnosis Input device Local area Quantity of discussion Active Negative power Neutral property
Confirmation of diagnosis 1.00 -0.26 0.99 0.55 0.54 0.47 0.52
Input device -0.26 1.00 -0.27 -0.13 -0.21 -0.32 -0.08
Local area 0.99 -0.27 1.00 0.55 0.54 0.47 0.52
Quantity of discussion 0.55 -0.13 0.55 1.00 0.91 0.89 0.98
Active 0.54 -0.21 0.54 0.91 1.00 0.91 0.80
Negative power 0.47 -0.32 0.49 0.89 0.91 1.00 0.82
Neutral property 0.52 -0.08 0.52 0.98 0.80 0.82 1.00
By combining the matrix, it can be seen that there is a certain correlation between the number of discussions and the number of confirmed cases locally, and compared with the case number input with confirmed cases, the correlation between the number of cases input with confirmed cases and the number of discussions is low and has a negative correlation trend, and then correlation tests are performed on each variable.
Firstly, 7 variables are subjected to normality test respectively, and the results of the test can confirm that the 7 variables are all subjected to the normality test, and all the variables can be subjected to correlation analysis (the significance level is 0.05). The results of the correlation test are shown in table 5 below:
table 5 correlation test result table of case number and microblog number
Confirmation of diagnosis Input device Local area Quantity of discussion Active Negative power Neutral property
Confirmation of diagnosis 0.00 0.65 0.00 0.02 0.02 0.05 0.02
Input device 0.15 0.00 0.65 0.94 0.73 0.44 0.94
Local area 0.00 0.13 0.00 0.02 0.02 0.05 0.02
Quantity of discussion 0.00 0.47 0.00 0.00 0.00 0.00 0.00
Active 0.00 0.24 0.00 0.00 0.00 0.00 0.00
Negative power 0.01 0.07 0.01 0.00 0.00 0.00 0.00
Neutral property 0.00 0.67 0.00 0.00 0.00 0.00 0.00
Observing the correlation test results, it can be concluded that: the total number of discussions has linear correlation with the newly added confirmed number of people and the locally confirmed number of people, but has no linear correlation with the input confirmed number of people, and similarly, the positive emotion and the neutral emotion also have linear correlation with the newly added confirmed number of people and the locally confirmed number of people, but have no linear correlation with the input confirmed number of people; the negative emotion and the three have no linear correlation relationship.
For positive emotions, people are easy to be inspired at the time of distress, and the positive emotions are easy to be mobilized; neutral emotions, official attention is inevitably attracted by the rising of people who have confirmed epidemic situations, and the number of emotions is greatly increased by the release of a large number of announcement notices; for negative emotions, the vast network people do not generate negative emotions due to the deterioration of epidemic situations, and actually, the current staged success of mental health construction is reflected from certain achievements, namely, the public has higher capability of bearing negative events.
According to the method for analyzing the microblog emotion of the new crown pneumonia epidemic situation based on the BERT, a set of integrated crawling, cleaning, storing, analyzing and visualizing methods is constructed, and automatic updating, sorting, analyzing and visualizing of data can be achieved. Acquiring related structured and unstructured data of the existing public epidemic situation, performing emotion analysis by using a parameter-adjusted BERT model, and generating a visual analysis result; the BERT model is trained and parameter-adjusted by using the manually labeled epidemic situation microblog emotion training set, so that the accuracy of the model for epidemic situation data is improved. For the result of the public sentiment analysis, the method provided by the embodiment covers the discussion of the relationship between the public sentiment analysis result and the number of epidemic cases and covers the word cloud picture drawn in a time-sharing manner, so that the condition between the development of the epidemic public sentiment and the outbreak of the epidemic situation is more effectively explained.
Example 2
The embodiment provides a BERT-based microblog emotion analysis system for new crown pneumonia epidemic situation, aiming at the method provided by the embodiment 1. The specific scheme is as follows:
a BERT-based microblog emotion analysis system for new crown pneumonia epidemic situation is shown in an attached figure 12 in the specification and comprises an acquisition unit, a processing unit and an analysis unit. A collecting unit: the system is used for acquiring microblog public opinion information and epidemic situation data related to public opinions; a processing unit: the microblog public opinion information preprocessing module is used for preprocessing microblog public opinion information to obtain microblog public opinion data; an emotion analysis unit: the system comprises a language model, a microblog public sentiment data analysis module and a microblog public sentiment data analysis module, wherein the language model is used for carrying out sentiment analysis on microblog public sentiment data to obtain a sentiment analysis result; a result analysis unit: and the method is used for processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data to obtain the relevance analysis of the new coronary pneumonia epidemic situation.
Wherein, the processing unit is as shown in fig. 13 of the specification, and specifically includes: sensing and cleaning unit: the comment information processing device is used for carrying out data perception and cleaning on the comment information to obtain first information; a format conversion unit: converting the format of text data of the first information to obtain second information; a word segmentation processing unit: the word segmentation processing module is used for carrying out word segmentation processing on the second information to obtain third information; a stop word processing unit: the processing module is used for processing stop words of the third information to obtain fourth information; a feature conversion unit: and the microblog public opinion processing module is used for carrying out feature conversion processing on the fourth information to obtain microblog public opinion data.
The emotion analysis unit is shown in the attached figure 14 of the specification and comprises: a data dividing unit: the microblog public opinion data processing system is used for dividing microblog public opinion data into a training set, a verification set and a test set; a model acquisition unit: the language model is obtained according to the verification set and the test set; emotion analysis result acquisition means: and the emotion analysis module is used for carrying out emotion analysis on the training set according to the language model and obtaining emotion analysis results.
Wherein the result analyzing unit includes: epidemic situation development analysis unit: the system is used for processing emotion analysis results according to microblog public opinion information and epidemic situation data to obtain epidemic situation development analysis; epidemic situation emotion analysis unit: and the microblog public opinion analysis system is used for processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data to obtain epidemic situation emotion analysis, wherein the epidemic situation emotion analysis comprises correlation analysis of epidemic situation outbreak conditions and public opinion emotion change.
In this embodiment, on the basis of embodiment 1, the BERT-based microblog emotion analysis method for the new crown pneumonia epidemic situation proposed in embodiment 1 is systematized to form a BERT-based microblog emotion analysis system for the new crown pneumonia epidemic situation, which covers automatic updating, sorting, analyzing and visualizing of data. The system can collect the related structured and unstructured data of the existing public epidemic situation, and uses the BERT model after parameter adjustment to carry out emotion analysis and output a visual analysis result. Microblog public sentiment information and epidemic situation data in an epidemic situation period are analyzed through the system, the relation between epidemic situation change and public sentiment development is analyzed, and the analysis covers the discussion of public sentiment change trend and epidemic situation outbreak severity/case number.
Example 3
Fig. 15 is a schematic structural diagram of a computer device according to embodiment 3 of the present invention. The computer device 12 shown in fig. 15 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in FIG. 15, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16. Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device computer 12 and includes both volatile and nonvolatile media, removable and non-removable media. The system memory 28 may include computer system readable media in the form of volatile memory.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices that enable computer device 12 to communicate with one or more other computing devices.
The processing unit 16 executes various functional applications and data processing by running the program stored in the system memory 28, for example, implementing the BERT-based microblog emotion analysis method for the new crown pneumonia epidemic situation provided in embodiment 1 of the present invention, the method including:
s1, acquiring microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation; s2, preprocessing microblog public sentiment information to obtain microblog public sentiment data; s3, performing sentiment analysis on the microblog public sentiment data through the language model to obtain a sentiment analysis result; and S4, processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data to obtain the relevance analysis of the new coronary pneumonia epidemic situation. Wherein S2 specifically includes: s21, performing data perception and cleaning on microblog public sentiment information to obtain first information; s22, converting the format of the text data of the first information to obtain second information; s23, performing word segmentation processing on the second information to obtain third information; s24, carrying out stop word processing on the third information to obtain fourth information; and S25, performing feature conversion processing on the fourth information to obtain microblog public opinion data. Wherein S3 specifically includes: s31, dividing microblog public opinion data into a training set, a verification set and a test set; s32, obtaining a language model according to the verification set and the test set; and S33, performing emotion analysis on the training set according to the language model to obtain emotion analysis results.
According to the embodiment, the BERT-based microblog emotion analysis method for the new crown pneumonia epidemic situation is applied to specific computer equipment, the method is stored in the memory, and when the memory is executed by the actuator, the method can be operated to carry out public opinion analysis.
Of course, those skilled in the art can understand that the processor can also implement the technical scheme of the BERT-based microblog emotion analysis method for the new crown pneumonia epidemic situation provided by any embodiment of the present invention.
Example 4
This embodiment 4 provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation, where the method includes:
s1, acquiring microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation; s2, preprocessing microblog public sentiment information to obtain microblog public sentiment data; s3, performing sentiment analysis on the microblog public sentiment data through the language model to obtain a sentiment analysis result; s4, processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data, and obtaining correlation analysis of the new coronary pneumonia epidemic situation. Wherein S2 specifically includes: s21, performing data perception and cleaning on microblog public sentiment information to obtain first information; s22, converting the format of the text data of the first information to obtain second information; s23, performing word segmentation processing on the second information to obtain third information; s24, carrying out stop word processing on the third information to obtain fourth information; and S25, performing feature conversion processing on the fourth information to obtain microblog public opinion data. Wherein S3 specifically includes: s31, dividing microblog public opinion data into a training set, a verification set and a test set; s32, obtaining a language model according to the verification set and the test set; and S33, performing emotion analysis on the training set according to the language model to obtain emotion analysis results.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
In this embodiment, a BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation is applied to a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation provided by the invention are realized, so that the BERT-based microblog emotion analysis method is simple, convenient and fast, easy to store and not easy to lose.
In summary, the invention provides a BERT-based microblog emotion analysis method, equipment and medium for new crown pneumonia epidemic situation. By constructing a set of integrated crawling, cleaning, storing, analyzing and visualizing methods, automatic updating, sorting, analyzing and visualizing of data can be realized. Acquiring related structured and unstructured data of the existing new crown pneumonia epidemic situation, performing emotion analysis by using a BERT model after parameter adjustment, and generating a visual analysis result; training and parameter adjustment are carried out on the BERT model by using a manually labeled epidemic situation microblog emotion training set, so that the accuracy of the model for epidemic situation data is improved; aiming at the public opinion sentiment analysis result, the method provided by the invention covers the relation discussion between the public opinion sentiment analysis result and the epidemic situation case number, and covers the word cloud picture drawn in a time-sharing way, so that the condition between the development of the epidemic public opinion and the outbreak of the epidemic situation is more effectively explained.
It will be understood by those skilled in the art that the modules or steps of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented by program code executable by a computing device, such that it may be stored in a memory device and executed by a computing device, or it may be separately fabricated into various integrated circuit modules, or it may be fabricated by fabricating a plurality of modules or steps thereof into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
The above disclosure is only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims (15)

1. A BERT-based microblog emotion analysis method for new crown pneumonia epidemic situation is characterized by comprising the following steps:
s1, acquiring microblog public opinion information and epidemic situation data related to the new coronary pneumonia epidemic situation;
s2, preprocessing the microblog public opinion information to obtain microblog public opinion data;
s3, performing sentiment analysis on the microblog public sentiment data through a language model to obtain a sentiment analysis result;
s4, processing the emotion analysis result according to the microblog public opinion information and the epidemic situation data, and obtaining correlation analysis of the new coronary pneumonia epidemic situation.
2. The method of claim 1, wherein said correlation analysis comprises an epidemic progression analysis and an epidemic emotion analysis;
the epidemic situation feeling analysis comprises correlation analysis of epidemic situation outbreak and public sentiment emotion change.
3. The method of claim 2, wherein the public sentiment emotional changes are decomposed into: the total number of comments, the number of positive comments, the number of neutral comments and the number of negative comments, and the epidemic situation outbreak condition is decomposed into: the total number of newly-added cases, the number of newly-added input cases and the number of newly-added local cases;
and carrying out correlation test according to the index disassembled from the public opinion emotion change and the epidemic situation outbreak condition.
4. The method according to claim 3, characterized in that said correlation test comprises in particular:
and sequentially carrying out normal distribution test, correlation coefficient calculation and significance test on the indexes.
5. The method according to claim 1, wherein the S1 specifically includes:
capturing contents in the internet by a Python crawler compiling program, and extracting microblog public opinion information and epidemic situation data related to the new crown pneumonia epidemic situation from the contents;
and storing the microblog public opinion information and the epidemic situation data in a database.
6. The method of claim 1,
the epidemic situation data is statistics of confirmed, suspected, cured and dead cases in the area and statistics of input cases in the area;
the microblog public opinion information is microblog content related to epidemic on a microblog social network site;
the microblog public opinion information comprises comment information, and the comment information is related comment information issued by netizens on microblogs aiming at the new crown pneumonia epidemic situation.
7. The method of claim 6, wherein the comment information is tagged with an emotion tag, and wherein the comment information is classified according to the emotion tag as:
positive comment on information: comment data with positive emotional tendency;
negative comment information: comment data with negative emotional tendencies;
neutral comment information: comment data without emotional tendency;
abnormal comment information: comment data for emotional tag anomalies.
8. The method according to claim 1, wherein the S2 specifically includes:
s21, performing data perception and cleaning on the microblog public sentiment information to obtain first information;
s22, converting the format of the text data of the first information to obtain second information;
s23, performing word segmentation processing on the second information to obtain third information;
s24, performing stop word processing on the third information to obtain fourth information;
and S25, performing feature conversion processing on the fourth information to obtain the microblog public opinion data.
9. The method according to claim 8, wherein at S21, specifically comprising: performing data perception and cleaning on the positive comment information, the negative comment information and the neutral comment information;
judging the emotion label again for the abnormal comment information;
and/or in S22, specifically including: converting the code of the first information into a UTF-8 code to obtain the second information;
and/or in S23, specifically including: based on a Chinese word segmentation library, carrying out word segmentation processing on the second information through a hidden Markov model to obtain third information;
and/or in S24, specifically including: sequentially scanning the words of the third information based on a stop word list, if the words are in the stop word list, removing the words, and if the words are not in the stop word list, keeping the words;
and/or in S25, specifically including: and converting the fourth information into vector data from text data by performing word embedding processing, fragment embedding processing and position embedding processing on the fourth information, and acquiring microblog public opinion data.
10. The method according to claim 1, wherein the S3 specifically includes:
s31, dividing the microblog public opinion data into a training set, a verification set and a test set;
s32, obtaining a language model according to the verification set and the test set;
s33, performing emotion analysis on the training set according to the language model to obtain emotion analysis results;
wherein, the S32 specifically includes:
loading a language model;
training the language model by using the verification set to obtain a trained language model;
testing the trained language model by using the test set;
and acquiring the language model passing the test.
11. The method of claim 10, wherein the language model comprises a BERT model;
the input of the BERT model is a comprehensive expression vector consisting of word embedding, segment embedding and position embedding.
12. The method of claim 11, wherein the BERT model has a loss function of softmax cross entropy, which is calculated as follows:
Figure FDA0002831119240000031
wherein L is a loss function, SjFor the jth value of the softmax cross-entropy output vector S, which represents the probability that this sample belongs to the jth class, T is the number of classes, yjIs a label vector.
13. The method of claim 11, wherein the BERT model text direction is set to bidi, the dropout probability of the hidden layer is set to 0.1, the number of neurons in the encoder hidden layer is 3072, the pooling layer size is 128, the number of attention mechanism heads is 12, and the number of layers is 3.
14. A computer device, characterized in that the computer device comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the BERT-based new crown pneumonia epidemic microblog emotion analysis method of any one of claims 1-13.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the BERT-based new crown pneumonia epidemic microblog emotion analyzing method according to any one of claims 1 to 13.
CN202011445578.XA 2020-12-11 2020-12-11 BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation Pending CN112883286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011445578.XA CN112883286A (en) 2020-12-11 2020-12-11 BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011445578.XA CN112883286A (en) 2020-12-11 2020-12-11 BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation

Publications (1)

Publication Number Publication Date
CN112883286A true CN112883286A (en) 2021-06-01

Family

ID=76043300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011445578.XA Pending CN112883286A (en) 2020-12-11 2020-12-11 BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation

Country Status (1)

Country Link
CN (1) CN112883286A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468331A (en) * 2021-07-07 2021-10-01 西北大学 Public opinion information emotion classification method
CN113641867A (en) * 2021-08-16 2021-11-12 中国科学院自动化研究所 System, method and equipment for measuring inter-city relation based on microblog public sentiment
CN113871019A (en) * 2021-12-06 2021-12-31 江西易卫云信息技术有限公司 Disease public opinion monitoring method, system, storage medium and equipment
CN114896522A (en) * 2022-04-14 2022-08-12 北京航空航天大学 Multi-platform information epidemic situation risk assessment method and device
CN117422063A (en) * 2023-12-18 2024-01-19 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150044208A1 (en) * 2011-09-23 2015-02-12 Technophage, Investigaçäo E Desenvolvimento Em Biotecnologia, Sa Modified Albumin-Binding Domains and Uses Thereof to Improve Pharmacokinetics
CN111444404A (en) * 2020-03-19 2020-07-24 杭州叙简科技股份有限公司 Social public opinion monitoring system based on microblog and monitoring method thereof
CN111933300A (en) * 2020-09-28 2020-11-13 平安科技(深圳)有限公司 Epidemic situation prevention and control effect prediction method, device, server and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150044208A1 (en) * 2011-09-23 2015-02-12 Technophage, Investigaçäo E Desenvolvimento Em Biotecnologia, Sa Modified Albumin-Binding Domains and Uses Thereof to Improve Pharmacokinetics
CN111444404A (en) * 2020-03-19 2020-07-24 杭州叙简科技股份有限公司 Social public opinion monitoring system based on microblog and monitoring method thereof
CN111933300A (en) * 2020-09-28 2020-11-13 平安科技(深圳)有限公司 Epidemic situation prevention and control effect prediction method, device, server and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘洪浩: "基于深度学习的COVID-19疫情期间网名情绪分析", 《软件》 *
陈璟浩等: "突发公共卫生事件中中国网民关注度分析—基于新冠肺炎网络舆情数据", 《现代情报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468331A (en) * 2021-07-07 2021-10-01 西北大学 Public opinion information emotion classification method
CN113641867A (en) * 2021-08-16 2021-11-12 中国科学院自动化研究所 System, method and equipment for measuring inter-city relation based on microblog public sentiment
CN113641867B (en) * 2021-08-16 2023-07-14 中国科学院自动化研究所 Inter-city relationship measurement system, method and equipment based on microblog public opinion
CN113871019A (en) * 2021-12-06 2021-12-31 江西易卫云信息技术有限公司 Disease public opinion monitoring method, system, storage medium and equipment
CN114896522A (en) * 2022-04-14 2022-08-12 北京航空航天大学 Multi-platform information epidemic situation risk assessment method and device
CN117422063A (en) * 2023-12-18 2024-01-19 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system
CN117422063B (en) * 2023-12-18 2024-02-23 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system

Similar Documents

Publication Publication Date Title
Hazarika et al. Cascade: Contextual sarcasm detection in online discussion forums
Onan et al. A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification
Su et al. Analyzing public sentiments online: Combining human-and computer-based content analysis
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
CN107092596B (en) Text emotion analysis method based on attention CNNs and CCR
CN112883286A (en) BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation
UzZaman et al. TRIPS and TRIOS system for TempEval-2: Extracting temporal information from text
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN105740228A (en) Internet public opinion analysis method
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN114064918A (en) Multi-modal event knowledge graph construction method
CN111581956B (en) Sensitive information identification method and system based on BERT model and K nearest neighbor
Zhu et al. Relationship extraction method for urban rail transit operation emergencies records
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
Yang et al. A network method for identifying the root cause of high-speed rail faults based on text data
Hegde et al. Employee sentiment analysis towards remote work during COVID-19 using Twitter data
Qiu et al. NeuroSPE: A neuro‐net spatial relation extractor for natural language text fusing gazetteers and pretrained models
Chowdhury et al. BERT-Based Emotion Classification Approach with Analysis of COVID-19 Pandemic Tweets
CN115455202A (en) Emergency event affair map construction method
CN114417008A (en) Construction engineering field-oriented knowledge graph construction method and system
Hu et al. A classification model of power operation inspection defect texts based on graph convolutional network
Gao et al. Sensitive image information recognition model of network community based on content text
Shigehalli Natural language understanding in argumentative dialogue systems
Yu et al. History question classification and representation for Chinese Gaokao

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination