CN113449508A

CN113449508A - Internet public opinion correlation deduction prediction analysis method based on event chain

Info

Publication number: CN113449508A
Application number: CN202110799240.2A
Authority: CN
Inventors: 李仁德; 马皓添; 曹春萍
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-09-28
Anticipated expiration: 2041-07-15
Also published as: CN113449508B

Abstract

The invention discloses an event chain-based online public opinion correlation deduction prediction analysis method, which comprises the following steps: extracting related node events from microblog text data according to event development, and forming a subject evolution tree and a public opinion evolution probability map through clustering association and mode mining matching prediction; pre-training a microblog training set corpus through an ELMO model, and vectorizing microblog blogs; merging similar microblog data through an One-Pass clustering algorithm to obtain a node event set; learning of few-label data is carried out through ActiveLearing, and the label quality is improved through man-machine interaction; and finding the transfer logarithm among the labels through Seq2Pat, further constructing a Markov chain, and forming a theme tree and predicting the evolution trend of the node event. According to the invention, the extraction, association and prediction of sub-events in complex public sentiment events are made up, and the man-machine interaction labeling method of expert intervention is fully considered, so that the method is suitable for public sentiment deduction analysis positioning and accurate prediction.

Description

Internet public opinion correlation deduction prediction analysis method based on event chain

Technical Field

The invention relates to the technical field of book recommendation systems, in particular to an online public opinion correlation deduction prediction analysis method based on an event chain.

Background

Microblog public sentiment is a new form of internet public sentiment expression and becomes an uncertain factor influencing national security and social stability. The method has practical significance for assisting guidance work of the network public opinion and predicting the macroscopic development direction of similar events by researching the change of microblog discussion content of the network public opinion emergency and mining the public opinion evolution rule in the development process of the events. However, how to effectively organize the massive public sentiment information is always a research hotspot in recent years. The traditional network public opinion analysis only focuses on the content evolution of the whole public opinion in social media, but ignores the fine-grained strength evolution process of an event, so that a user is difficult to capture the evolution process of different aspects of the event. But also the interpretation of the public sentiment evolution is limited in the face of increasingly huge data information. Under the circumstance, how to find and track the generation and evolution of events in massive data and mine the trend of the events becomes an important problem in public opinion analysis.

The invention patent CN201910120187.1 provides a method for predicting an evolution result of network public sentiment, which obtains all individual information of network public sentiment propagation and constructs a directed weighting network according to the individual information; extracting individual initial opinions from the individual information, and calculating a prediction result of the evolution of the network public opinions along with time according to the initial opinions and the directed weighting network; the invention patent CN201910452142.4 provides a continuous Markov-based two-layer network public opinion information propagation prediction method, which can predict public opinion conditions according to on-line and off-line two-layer network nodes; the invention patent CN201610096775.2 provides a method for analyzing and predicting network public sentiment based on LDA topic model, which obtains the variation trend of each LDA topic model strength along with time from the training result, and realizes the dynamic analysis and prediction functions of network public sentiment. The invention patent CN202010668147.3 discloses a method, a device and an apparatus for predicting the stable condition of public opinion propagation, which are used for accurately judging the stable condition of public opinion propagation in a target social platform.

The invention patent does not deeply analyze the connection and development process of topic contents, so that a user cannot clearly master the main contents and the evolution process of an event. The basic idea is to extract the content information of events in different development stages and show the content information to the user in time sequence. However, the key points are to extract which events are the most important, whether the events have an evolutionary relationship, and what method can predict the situation trend similar to the network public sentiment, and the traditional evolutionary research does not make an in-depth answer.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an event chain-based network public opinion correlation deduction prediction analysis method, which makes up the extraction, correlation and prediction of sub-events in complex public opinion events, fully considers the human-computer interaction labeling method of expert intervention, and is suitable for public opinion deduction analysis positioning and accurate prediction. To achieve the above objects and other advantages and in accordance with the purpose of the invention, there is provided an event chain-based internet public opinion association deduction prediction analysis method, comprising:

s1, pre-training the microblog training set corpus through an ELMO model to obtain a pre-training model, and then performing word vector processing on the test set;

s2, merging similar microblog data through an One-Pass clustering algorithm to obtain a node event set;

s3, performing vectorization processing on the keywords of the node event set in the step S2 by using the pre-training model again to obtain vectorized representation of the node set;

s4, Learning the data with few labels through Active Learning, and labeling each node;

s5, finding the transfer logarithm among the labels through Seq2 Pat;

s6, establishing association among the node events through a Markov chain, and generating an association tree;

and S7, predicting the public sentiment development trend of each node event through the n-step Markov chain.

Preferably, the step S1 further includes performing data preprocessing on the microwave text, including the following steps:

s11, extracting keywords from the segmented Bowen word collections through a TextRank algorithm to obtain Bowen keyword word collections, and constructing the Bowen keyword word collections on an EMLo model;

s12, using a bidirectional LSTM language model, splicing ELMo directly as a feature to the word vector input of a specific task model or the highest-level representation of the model, and vectorizing words in the test corpus.

Preferably, the step S2 includes performing cosine similarity calculation on each vectorized blew, and obtaining the node event through One-Pass clustering.

Preferably, the step S3 includes the steps of:

s31, performing word vectorization again through the pre-training model obtained in the step S1;

s32, performing semantic disambiguation task, and calculating all vocabulary vector representations in the corpus by using the SemCOR3.0 corpus and using the BilMs;

s33, the possible positions and vectorization of the vocabulary are obtained by using the 1-neighbor method.

Preferably, in step S4, the node event is labeled by Active Learning based on Pool-based Sampling (Pool-based Sampling).

Preferably, in step S5, the sequence database representation method of the multi-valued decision diagram (MDD) is used to compactly encode the item sequence and its attributes by using symmetry, using a Seq2Pat constraint-based sequential pattern mining algorithm.

Preferably, in step S6, a markov chain model is calculated according to the prior probability and the transition probability of the node event, and an event chain state transition probability tree is generated

Preferably, in step S7, a steady state is calculated through an n-step markov chain, and a probability map of event evolution of each child node is generated.

Compared with the prior art, the invention has the beneficial effects that: .

Drawings

Fig. 1 is a flowchart illustrating an event chain-based internet public opinion association deduction prediction analysis method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an online public opinion association deduction prediction analysis method based on an event chain includes: s1, pre-training the microblog training set corpus through an ELMO model to obtain a pre-training model, and then performing word vector processing on the test set;

s5, finding the transfer logarithm among the labels through Seq2 Pat;

Further, the step S1 further includes performing data preprocessing on the microwave text, including the following steps:

Further, the step S2 includes performing cosine similarity calculation on each piece of vectorized bonsai, and obtaining a node event through One-Pass clustering.

Further, the step S3 includes the following steps:

Further, in the step S4, the node event is labeled by Active Learning based on Pool-based Sampling (Pool-based Sampling).

Further, in step S5, the sequence database representation method of the multi-valued decision diagram (MDD) is used to compactly encode the item sequence and its attributes by using symmetry, using the Seq2Pat constraint-based sequential pattern mining algorithm.

Further, in step S6, a markov chain model is calculated according to the prior probability and the transition probability of the node event, and an event chain state transition probability tree is generated

Further, in step S7, through the n-step markov chain, a steady state is calculated, and an event evolution probability map of each child node is generated.

The embodiment mode is as follows:

step S1 ELMO model word vectorization. (1) And (5) preprocessing the text. The text is further processed on the basis of the data preprocessing. Which contains a number of special symbols, expressions, links, etc. that are not related to the event. Such as frequently appearing words "forward", "microblog", "@ user", and notational symbols, if not removed, the following text analysis may cause interference. The text content of the # topic # "visually expresses a topic content, and therefore needs to be preserved. And performing word segmentation on the processed microblog text content, wherein a jieba word segmentation tool is used. And removing stop words, analogous words and special symbols from the microblog text after word segmentation. Words such as 'di', 'o' and the like and useless punctuation marks are obtained to obtain each blog word collection; (2) and (5) extracting keywords. In order to quickly acquire the core content of the text, the theme of the text is highly condensed, and firstly, keywords are extracted from the text. Extracting keywords from the word collections of the blogs after word segmentation by a TextRank algorithm to obtain word collections of the blogs keywords; (3) a bidirectional LSTM language model is constructed in an EMLo model and used and comprises a forward language model and a backward language model, each layer of LSTM cell has 4096 units and 512-dimensional mapping, and an objective function is the maximum likelihood of the two direction language models. After the language model is pre-trained, each intermediate layer of the bi-directional language model is summed, using the representation of the highest layer as ELMo. (4) When a supervised NLP task is carried out, ELMo is directly spliced to word vector input of a specific task model or the highest-layer representation of the model as a feature, and words in a test corpus are vectorized.

And step S2 One-Pass clustering. Similarity calculation is carried out on each piece of vectorized Bowen. Reading a new blog from a data set initially; constructing a new cluster by the blog; if the end of the data set is reached, 6 is turned, otherwise, a new blog article is read in; the distance between it and each existing cluster is calculated and the cluster with the smallest distance to it is selected. If the minimum distance exceeds a given threshold value, turning to the second step; if not, merging the object into the cluster, updating the cluster center, and turning to the third step; and sixthly, finishing. In the algorithm, a cosine distance formula is adopted to calculate the distance between the node and the cluster center. The network public sentiment corresponds to a plurality of different stages and related derived public sentiment events, the network public sentiment events are marked as node events, and the node events corresponding to the development stages of the network public sentiment events are described in a general way according to the extracted public sentiment event clusters. For example, 9.11 terrorist attack events include development stages such as "attack terrorism occurs", "government emergency response", "casualty shirts", "victim mourning" and "suspect arrest", which are node events. The node event expresses the core meaning of the derived public opinion event, and can be used as the abstract of the event to help us to describe which sides or analysis angles the event contains, and different sides also correspond to different attention points of public opinion.

Step S3 node event vectorization. According to the clustering of S2, each node event comprises a plurality of displayed Bowen, and the Bowen words in each cluster are collected to obtain union to obtain the comprehensive expression of the node events. Massive public sentiment events are reduced to the order of magnitude of artificial interpretable events through text clustering, and meanwhile, mutually independent node events are obtained. The representation of the node event is only regarded as a simple bag of words, so that evolution analysis is inconvenient, and public opinion event abstract is described in a general way by combining the obtained node event cluster keyword set with text content corresponding to original data. For each node event, the word vectorization is again performed through the pre-trained model obtained at S1. The semantic disambiguation task then proceeds using the SemCOR3.0 corpus, which is a semantically labeled corpus of which each word corresponds to a location of wordnet. The calculation based on the corpus is performed by first using the BilMs to calculate all the vocabulary vector representations in the corpus. The vectors of the vocabulary located at the same position of the wordnet are then averaged. During testing, after a result is obtained for a given target word in a target Bowen by using the Bilm, the possible position of the word is solved by using the 1-neighbor method and the initial vector of the word in each wordnet position obtained during training.

And step S4, labeling by human-computer interaction. Node events are labeled by Active Learning based on Pool-based Sampling (Pool-based Sampling). Dividing data into a pool and a test set; selecting k samples from the pool for initial training and marking the samples, wherein the rest data become a verification set; normalizing all the sets; fourthly, training the model by using the training set with balanced weight; using the trained model and the verification set together to obtain the probability of each sample; sixthly, using the trained model and the test set together to obtain a performance index; selecting the k most informative samples, i.e. those samples for which the model is most uncertain, according to the probability of each sample; moving the k samples from the verification set to a training set and inquiring labels of the k samples; ninthly, performing inverse normalization on all data sets; the r stops according to the stop criterion, otherwise go to the third. During the labeling process, the following points need to be noted: (1) the fully supervised performance of the selected algorithm is usually an upper bound and several algorithms like Support Vector Machines (SVM) with linear kernels, Random Forests (RF) and logistic regression (LOG) need to be proposed to try. (2) After removing samples from the validation set, the normalization of all sets must be reversed and normalized again, since our sample distribution changes in both the new validation set and the new training set. (3) The sample selection function relies on test sample probabilities derived from the trained model, so we can only use algorithms that provide access sample probabilities. (4) k is a hyper-parameter. Four known selection functions are proposed: a. random selection-k random samples are selected from the validation set; b. entropy selection-select the k samples with the highest entropy; c. guarantee gold selection-we select k samples whose difference between the two highest class probabilities is the smallest, i.e. for samples for which the model is very well defined as a class, a higher number will be given, and for samples for which the class probability is very high, a higher number will be given similarly.

Step S5 Seq2Pat association mining. Seq2Pat is a sequential database representation method for constraint-based sequential pattern mining algorithm using a multi-valued decision diagram (MDD) to compactly encode a sequence of items and their attributes by exploiting symmetry. The MDD representation is augmented with constraint-specific information to guarantee or enforce satisfaction of the constraints during the mining algorithm. First, the constraint satisfaction is performed only once, rather than once per projection database as with the pre-projection algorithm. Second, several constraints can be considered simultaneously, as opposed to an iterative approach that considers each constraint individually and results in a large computational cost. Finally, imposing constraints results in a smaller MDD and thus reduces the computational requirements on the mining algorithm. If the MDD is prefix monotonic or the prefix is inverse monotonic, then the constraints can be directly applied to it. Such an unfeasible expansion of constraints is prevented by not creating arcs between their respective nodes. For the sequence S, the algorithm corresponds to the entry S [ j ]: j ═ S | at position S [ j ], starting from the node correlation, and checks whether this entry can be used to extend the pattern ending with any previous entry I e LJ' < LJ of the sequence. An arc (u, v) is created between each project node of the MDD as long as the extension is feasible. The algorithm then increments the entry for the j-1 position and repeats the same process. With the above construction, one node is connected to all nodes, representing a feasible extension with respect to imposed constraints. Thus, the mining algorithm only needs to search the child nodes of node U ∈ U to extend any pattern that ends with U. If the extension of the item I at Sj to the item at Sj '] is not feasible, then it is guaranteed that any extension of I to the item Sk: k ≧ j' is also not feasible. If a constraint is non-monotonic, it needs to be checked for its satisfaction with all possible extensions, which can only be done if all monotonic and inverse monotonic constraints are satisfied. The various constraint types include: average — this constraint specifies the average of the attributes for all events in the pattern; gap-the difference between the attribute values of every two consecutive events in the pattern specified by this constraint; median — this constraint specifies the median of the attributes in all events in the schema; span-this constraint specifies the difference between the maximum and minimum of the attribute in all events in the pattern.

And S6-7, storing the public sentiment events in the constructed event chain diagram in a mode of combining every two events in a node event pair, wherein each event pair represents the possible evolution between the nodes of the network public sentiment event chain. Calculating the prior probability of each node event, and then calculating the evolution probability between event pairs by using a conditional probability formula. And adding corresponding evolution probability among the events on the constructed event chain diagram to obtain an event content evolution diagram. Corresponding to the evolution of the event content, the evolution process of the intensity on different time slices in the life cycle of the event reveals the change process of gradual transfer of the attention points of the user. The intensity evolution process reflects the change process of the attention degree of the event from climax to low tide or from low tide to climax by calculating the intensity value of the event on different time slices. According to the intensity evolution diagram, the change of public opinion on the same event attention angle in a period of time and the development process of the event can be considered. The strength of an event is mainly measured by the proportion of the occurrence event in a corpus text set, and the strength Markov for representing the event t on a time slice k is an effective method for predicting the occurrence probability of the event. In this process, the past (i.e. historical state before the current stage) is irrelevant for predicting the future (i.e. future state after the current stage), i.e. the probability distribution of the next state can only be determined by the current state. For public opinion evolution, firstly, the development of public opinion can be regarded as a non-stationary time sequence, and the time division and the state division can be described as discrete processes; secondly, the public sentiment evolution is strongly influenced by the current time state, namely the state at the t +1 time is only related to the state at the t time and is unrelated to the past state; finally, the transition from one state to another is random. These features are just enough to satisfy markov's condition for unproductive application. Therefore, the strength evolution trend of each derived public opinion event is predicted by adopting Markov. And constructing a Markov state transition matrix P according to the evolution probability, predicting the state probability of the future public opinion strength by using the initial state vector and the state transition matrix, and obtaining the stable distribution of the Markov chain by using an n-step transition formula. This steady state probability reflects the likelihood that the system will be in a certain state when it is stable. And simultaneously presenting an evolution probability chart of each node event.

The number of devices and the scale of the processes described herein are intended to simplify the description of the invention, and applications, modifications and variations of the invention will be apparent to those skilled in the art.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. An online public opinion correlation deduction prediction analysis method based on an event chain is characterized by comprising the following steps:

s5, finding the transfer logarithm among the labels through Seq2 Pat;

2. The method for internet public opinion correlation deduction and prediction analysis based on event chain as claimed in claim 1, wherein the step S1 further comprises data preprocessing of microwave text, comprising the steps of:

3. The method as claimed in claim 1, wherein the step S2 includes performing cosine similarity calculation on each vectorized blog article, and obtaining node events through One-Pass clustering.

4. The method for internet public opinion correlation deduction prediction analysis based on event chain as claimed in claim 1, wherein the step S3 includes the steps of:

5. The method as claimed in claim 1, wherein in the step S4, the node event is labeled by Active Learning of Pool-based Sampling (Pool-based Sampling).

6. The method for internet public opinion correlation deduction and prediction analysis based on event chain as claimed in claim 1, wherein the sequence of items and its attributes are compactly coded by using symmetry using a sequential database representation method of multi-valued decision diagram (MDD) using a Seq2Pat constraint-based sequential pattern mining algorithm in step S5.

7. The method as claimed in claim 1, wherein in the step S6, a markov chain model is calculated according to the prior probability and the transition probability of the node event, and an event chain state transition probability tree is generated.

8. The method as claimed in claim 1, wherein in the step S7, a steady state is calculated through n-step markov chains, and a probability map of event evolution of each child node is generated.