CN113449508B

CN113449508B - Internet public opinion correlation deduction prediction analysis method based on event chain

Info

Publication number: CN113449508B
Application number: CN202110799240.2A
Authority: CN
Inventors: 李仁德; 马皓添; 曹春萍
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2023-01-17
Anticipated expiration: 2041-07-15
Also published as: CN113449508A

Abstract

The invention discloses an event chain-based online public opinion correlation deduction prediction analysis method, which comprises the following steps: extracting related node events from microblog text data according to event development, and forming a subject evolution tree and a public opinion evolution probability map through clustering association and mode mining matching prediction; pre-training a microblog training set corpus through an ELMO model, and vectorizing microblog blogs; merging similar microblog data through an One-Pass clustering algorithm to obtain a node event set; learning of few-label data is carried out through ActiveLearing, and the label quality is improved through man-machine interaction; and finding the transfer logarithm among the labels through Seq2Pat, further constructing a Markov chain, and forming a theme tree and predicting the evolution trend of the node event. According to the invention, the extraction, association and prediction of sub-events in complex public sentiment events are made up, and the man-machine interaction labeling method of expert intervention is fully considered, so that the method is suitable for public sentiment deduction analysis positioning and accurate prediction.

Description

Internet public opinion correlation deduction prediction analysis method based on event chain

Technical Field

The invention relates to the technical field of book recommendation systems, in particular to an online public opinion correlation deduction prediction analysis method based on an event chain.

Background

Microblog public sentiment is a new form of internet public sentiment expression and becomes an uncertain factor influencing national security and social stability. The method has practical significance for assisting guidance work of the network public opinion and predicting the macroscopic development direction of similar events by researching the change of microblog discussion content of the network public opinion emergency and mining the public opinion evolution rule in the development process of the events. However, how to effectively organize the massive public sentiment information has been a research focus in recent years. The traditional network public opinion analysis only focuses on the content evolution of the whole public opinion in social media, but ignores the fine-grained strength evolution process of an event, so that a user is difficult to capture the evolution process of different aspects of the event. But also the interpretation of the public sentiment evolution is limited in the face of increasingly huge data information. Under the circumstance, how to find and track the generation and evolution of events in massive data and mine the trend of the events becomes an important problem in public opinion analysis.

The invention patent CN201910120187.1 provides a method for predicting an evolution result of an internet public opinion, which obtains all individual information of the internet public opinion transmission, and constructs a directed weighting network according to the individual information; extracting individual initial opinions from the individual information, and calculating a prediction result of the evolution of the network public opinions along with time according to the initial opinions and the directed weighting network; the invention patent CN201910452142.4 provides a continuous Markov-based two-layer network public opinion information propagation prediction method, which can predict public opinion conditions according to on-line and off-line two-layer network nodes; the invention patent CN201610096775.2 provides a method for analyzing and predicting network public sentiment based on LDA topic models, and the change trend of the intensity of each LDA topic model along with time is obtained from training results, so that the dynamic analysis and prediction functions of the network public sentiment are realized. The invention patent CN202010668147.3 discloses a method, a device and equipment for predicting a public opinion propagation stable condition, which are used for accurately judging the public opinion propagation stable condition in a target social platform.

The invention patent does not deeply analyze the connection and development process of topic contents, so that a user cannot clearly master the main contents and the evolution process of an event. The basic idea is to extract the content information of events in different development stages and show the content information to the user in time sequence. However, the key points are to extract which events are the most important, whether the events have an evolutionary relationship, and what method can predict the situation trend similar to the network public sentiment, and the traditional evolutionary research does not make an in-depth answer.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an event chain-based online public opinion correlation deduction prediction analysis method, which makes up for extraction, correlation and prediction of sub-events in complex public opinion events, fully considers a man-machine interaction labeling method of expert intervention, and is suitable for public opinion deduction analysis positioning and accurate prediction. To achieve the above objects and other advantages and in accordance with the purpose of the invention, there is provided an event chain-based internet public opinion association deduction prediction analysis method, comprising:

s1, pre-training a microblog training set corpus through an ELMO model to obtain a pre-training model, and then performing word vector processing on a test set;

s2, merging similar microblog data through an One-Pass clustering algorithm to obtain a node event set;

s3, performing vectorization processing on the keywords of the node event set in the step S2 by using the pre-training model again to obtain vectorized representation of the node set;

s4, performing few-label data Learning through Active Learning, and labeling each node;

s5, finding the transfer logarithm among the labels through Seq2 Pat;

s6, establishing association among node events through a Markov chain, and generating an association tree;

and S7, predicting the public opinion development trend of each node event through the n-step Markov chain.

Preferably, the step S1 further includes performing data preprocessing on the microwave text, including the following steps:

s11, extracting keywords from the word-segmented Bowen keyword vocabulary set through a TextRank algorithm to obtain a Bowen keyword vocabulary set, and constructing the Bowen keyword vocabulary set on an EMLo model;

s12, using a bidirectional LSTM language model, splicing ELMo directly used as a feature to word vector input of a specific task model or the highest-layer representation of the model, and vectorizing words in the test corpus.

Preferably, the step S2 includes performing cosine similarity calculation on each piece of vectorized bonsai, and obtaining the node event through One-Pass clustering.

Preferably, the step S3 includes the steps of:

s31, performing word vectorization again through the pre-training model obtained in the step S1;

s32, performing a semantic disambiguation task, and calculating all vocabulary vector representations in the corpus by using the SemCOR3.0 corpus and using the BilMs;

s33, the possible position and vectorization of the vocabulary are solved by utilizing a 1-neighbor method.

Preferably, in the step S4, the node event is labeled by Active Learning based on Pool-based Sampling (Pool-based Sampling).

Preferably, in step S5, the sequence data base representation method of the multi-valued decision diagram (MDD) is used by using a Seq2Pat constraint-based sequential pattern mining algorithm, and the sequence of items and their attributes are compactly encoded by using symmetry.

Preferably, in step S6, a markov chain model is calculated according to the prior probability and the transition probability of the node event, and an event chain state transition probability tree is generated.

Preferably, in step S7, a steady state is calculated through the n-step markov chain, and an event evolution probability map of each child node is generated.

Compared with the prior art, the invention has the beneficial effects that: through ELMO model word vectorization, similarity calculation is carried out on each vectorized blog text through One-Pass clustering, node events express core meanings of derived public opinion events, the node events can be used as summaries of the events to help describe which side faces or analysis angles are included in the events, different side faces correspond to different attention points of public opinions, the node events are vectorized, and the obtained node event cluster keyword set is combined with text contents corresponding to original data to generally describe the summary of the public opinion events; labeling human-computer interaction, labeling node events through Active Learning based on Pool-based Sampling, storing public sentiment events in a constructed event chain diagram in a node event pair mode in a pairwise combination mode, wherein each event pair represents possible evolution between network public sentiment event chain nodes, the strength evolution process reflects the change process of the attention degree of the events from high tide to low tide or from low tide to high tide by calculating the strength values of the events at different time slices, and the strength evolution trend of each derived public sentiment event is predicted by adopting Markov. And constructing a Markov state transition matrix P according to the evolution probability, predicting the state probability of the future public opinion strength by using the initial state vector and the state transition matrix, and obtaining the stable distribution of the Markov chain by using an n-step transition formula. This steady state probability reflects the likelihood that the system will be in a certain state when it is stable. And simultaneously presenting an evolution probability chart of each node event.

Drawings

Fig. 1 is a flowchart illustrating an event chain-based internet public opinion association deduction prediction analysis method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an online public opinion association deduction prediction analysis method based on an event chain includes: s1, pre-training a microblog training set corpus through an ELMO model to obtain a pre-training model, and then performing word vector processing on a test set;

s3, performing vectorization processing on the keywords of the node event set in the step S2 again by using the pre-training model to obtain vectorized representation of the node set;

s5, finding the transfer logarithm among the labels through Seq2 Pat;

and S7, predicting the public sentiment development trend of each node event through the n-step Markov chain.

Further, the step S1 further includes performing data preprocessing on the microblog text, including the following steps:

s11, extracting keywords from the word collection of the microblog words after word segmentation through a TextRank algorithm to obtain a word collection of the microblog keywords, and constructing the word collection on an EMLo model;

s12, using a bidirectional LSTM language model, splicing ELMo directly as a feature to word vector input of a specific task model or the highest-layer representation of the model, and vectorizing words in the test corpus.

Further, the step S2 includes performing cosine similarity calculation on each piece of vectorized blog text, and obtaining a node event through One-Pass clustering.

Further, the step S3 includes the steps of:

Further, in the step S4, the node event is labeled by Active Learning based on Pool-based Sampling (Pool-based Sampling).

Further, in step S5, a sequence database representation method of a multi-valued decision diagram (MDD) is used, using a Seq2Pat constraint-based sequential pattern mining algorithm, to compactly encode the item sequence and its attributes by using symmetry.

Further, in step S6, a markov chain model is calculated according to the prior probability and the transition probability of the node event, and an event chain state transition probability tree is generated

Further, in step S7, a steady state is calculated through the n-step markov chain, and an event evolution probability map of each child node is generated.

The embodiment mode is as follows:

step S1 ELMO model word vectorization. And (1) text preprocessing. The text is further processed on the basis of the data preprocessing. Which contains a number of special symbols, expressions, links, etc. that are not related to the event. Such as frequently appearing words "forward", "microblog", "" @ user ", and flagging symbols, etc., interference may be caused to the subsequent text analysis if not removed. The text content of the # topic # "visually expresses a topic content, and therefore needs to be preserved. And performing word segmentation on the processed microblog text content, wherein a jieba word segmentation tool is used. And removing stop words, analogous words and special symbols from the microblog text after word segmentation. Words such as 'earth' and 'o' and useless punctuation marks are obtained to obtain each blog word collection; and (2) extracting keywords. In order to quickly acquire the core content of the text, the theme of the text is highly condensed, and firstly, keywords are extracted from the text. Extracting keywords from the word collections of the blogs after word segmentation by a TextRank algorithm to obtain word collections of the blogs keywords; (3) A bidirectional LSTM language model is constructed in an EMLo model and consists of a forward language model and a backward language model, each layer of LSTM cell has 4096 cells and 512-dimensional mapping, and an objective function is the maximum likelihood of the two direction language models. After the language model is pre-trained, each intermediate layer of the bi-directional language model is summed, using the representation of the highest layer as ELMo. (4) When a supervised NLP task is carried out, ELMo is directly spliced to word vector input of a specific task model or the highest-layer representation of the model as a feature, and words in a test corpus are vectorized.

And S2, clustering One-Pass. Similarity calculation is carried out on each piece of vectorized Bowen. (1) Reading a new blog from the data set at the beginning; (2) constructing a new cluster by the blog; (3) if the end of the data set is reached, turning to 6, otherwise reading in a new blog article; the distance between it and each existing cluster is calculated and the cluster with the smallest distance to it is selected. (4) If the minimum distance exceeds a given threshold, turning to (2); (5) otherwise, merging the object into the cluster, updating the cluster center, and turning to (3); (6) and (6) ending. In the algorithm, a cosine distance formula is adopted to calculate the distance between the node and the cluster center. The network public sentiment corresponds to a plurality of different stages and related derived public sentiment events, the network public sentiment events are marked as node events, and the node events corresponding to the development stages of the network public sentiment events are described in a general way according to the extracted public sentiment event clusters. For example, 9.11 terrorist attack events include development stages such as "attack terrorism occurs", "government emergency response", "casualty shirts", "victim mourning" and "suspect arrest", which are node events. The node event expresses the core meaning of the derived public opinion event, and can be used as the abstract of the event to help us to describe which sides or analysis angles the event contains, and different sides also correspond to different attention points of public opinion.

And step S3, vectorizing the node events. And according to the clustering of S2, each node event comprises a plurality of displayed blog articles, and the blog articles in each cluster are collected and subjected to union to obtain the comprehensive expression of the node events. Massive public sentiment events are reduced to the order of magnitude of artificial interpretable events through text clustering, and meanwhile, mutually independent node events are obtained. The representation of the node event is only regarded as a simple bag of words, so that evolution analysis is inconvenient, and public opinion event abstract is described in a general way by combining the obtained node event cluster keyword set with text content corresponding to original data. And for each node event, performing word vectorization again through the pre-training model obtained in the step S1. The semantic disambiguation task then proceeds using the SemCOR3.0 corpus, which is a semantically labeled corpus of which each word corresponds to a location of wordnet. The calculation based on the corpus is performed by first using the BilMs to calculate all the vocabulary vector representations in the corpus. The vectors of the vocabulary located at the same position of the wordnet are then averaged. During testing, after a result is obtained for a given target word in a target Bowen by using the Bilm, the possible position of the word is solved by using the initial vector of the word in each wordnet position obtained during training and using a 1-neighbor method.

And S4, man-machine interaction labeling. Node events are labeled by Active Learning based on Pool-based Sampling (Pool-based Sampling). (1) Dividing data into pools and test sets; (2) selecting k sample settings from the pool for initial training and marking them, and the rest data will become a validation set; (3) normalizing all the sets; (4) training the model with balanced weights using a training set; (5) using the trained model and the verification set together to obtain the probability of each sample; (6) using the trained model and the test set together to obtain a performance index; (7) selecting the k most informative samples, i.e. those samples for which the model is most uncertain, according to the probability of each sample; (8) moving the k samples from the verification set to the training set, and inquiring labels of the k samples; (9) performing inverse normalization on all data sets; the r stops according to a stop criterion, otherwise go to (3). During the labeling process, the following points need to be noted: (1) The fully supervised performance of the selected algorithm is usually an upper bound and several algorithms like Support Vector Machines (SVM) with linear kernels, random Forests (RF) and logistic regression (LOG) need to be proposed to try. (2) After removing samples from the validation set, the normalization of all sets must be reversed and normalized again, since our sample distribution changes in both the new validation set and the new training set. (3) The sample selection function relies on test sample probabilities derived from the trained model, so we can only use algorithms that provide access sample probabilities. And (4) k is a hyper-parameter. Four known selection functions are proposed: a. random selection-k random samples are selected from the validation set; b. entropy selection-select the k samples with the highest entropy; c. guarantee gold selection-we select k samples whose difference between the two highest class probabilities is the smallest, i.e. for samples for which the model is very well defined as a class, a higher number will be given, and for samples for which the class probability is very high, a higher number will be given similarly.

Step S5 Seq2Pat association mining. Seq2Pat is a sequential database representation method for constraint-based sequential pattern mining algorithm using a multi-valued decision diagram (MDD) to compactly encode a sequence of items and their attributes by exploiting symmetry. The MDD representation is augmented with constraint-specific information to ensure or enforce that constraints are met during the mining algorithm. First, the constraint satisfaction is performed only once, rather than once per projection database as with the pre-projection algorithm. Second, several constraints can be considered simultaneously, as opposed to an iterative approach that considers each constraint individually and results in a large computational cost. Finally, imposing constraints results in a smaller MDD and thus reduces the computational requirements on the mining algorithm. If the MDD is prefix monotonic or the prefix is inverse monotonic, then the constraints can be directly applied to it. Such an unfeasible expansion of constraints is prevented by not creating arcs between their respective nodes. For the sequence S, the algorithm corresponds to the entry S [ j ]: j = | S | at position S [ j ] from the node correlation and checks if this entry can be used to extend the pattern ending with any previous entry of the sequence I ∈ LJ' < LJ. An arc (u, v) is created between each project node of the MDD as long as the extension is feasible. The algorithm then increments the entry for the j-1 position and repeats the same process. With the above construction, one node is connected to all nodes, representing a feasible extension with respect to the imposed constraints. Thus, the mining algorithm only needs to search the child nodes of node U ∈ U to extend any pattern that ends with U. If the extension of the item I at Sj to the item at Sj '] is not feasible, then it is guaranteed that any extension of I to the item Sk: k ≧ j' is also not feasible. If a constraint is non-monotonic, it needs to be checked for its satisfaction with all possible extensions, which can only be done if all monotonic and inverse monotonic constraints are satisfied. The various constraint types include: average — this constraint specifies the average of the attributes for all events in the pattern; gap-the difference between the attribute values of every two consecutive events in the pattern specified by this constraint; median — this constraint specifies the median of the attributes in all events in the pattern; span-this constraint specifies the difference between the maximum and minimum of the attribute in all events in the pattern.

And S6-7, storing the public sentiment events in the constructed event chain diagram in a mode of combining every two events in a node event pair, wherein each event pair represents the possible evolution between the nodes of the network public sentiment event chain. Calculating the prior probability of each node event, and then calculating the evolution probability between event pairs by using a conditional probability formula. And adding corresponding evolution probability among events on the constructed event chain diagram to obtain an event content evolution diagram. Corresponding to the evolution of the event content, the evolution process of the intensity on different time slices in the life cycle of the event reveals the change process of gradual transfer of the attention points of the user. The intensity evolution process reflects the change process of the attention degree of the event from climax to low tide or from low tide to climax by calculating the intensity value of the event on different time slices. According to the intensity evolution diagram, the change of public opinion on the same event attention angle in a period of time and the development process of the event can be considered. The strength of an event is mainly measured by the proportion of the occurrence event in a corpus text set, and the strength Markov for representing the event t on a time slice k is an effective method for predicting the occurrence probability of the event. In this process, the past (i.e., historical state before the current date) is irrelevant for predicting the future (i.e., future state after the current date), i.e., the probability distribution of the next state can only be determined by the current state. For public opinion evolution, firstly, the development of public opinion can be regarded as a non-stationary time sequence, and the time division and the state division can be described as discrete processes; secondly, the public sentiment evolution is strongly influenced by the current time state, namely the state at the t +1 time is only related to the state at the t time and is unrelated to the past state; finally, the transition from one state to another is random. These features are just enough to satisfy markov's condition for unproductive application. Therefore, the Markov is adopted to predict the strength evolution trend of each derivative public opinion event. And constructing a Markov state transition matrix P according to the evolution probability, predicting the state probability of the future public opinion strength by using the initial state vector and the state transition matrix, and obtaining the stable distribution of the Markov chain by using an n-step transition formula. This steady state probability reflects the likelihood that the system will be in a certain state when it is stable. And simultaneously presenting an evolution probability chart of each node event.

The number of devices and the scale of the processes described herein are intended to simplify the description of the invention, and applications, modifications and variations of the invention will be apparent to those skilled in the art. While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. An online public opinion correlation deduction prediction analysis method based on an event chain is characterized by comprising the following steps:

s5, finding the transfer logarithm among labels through Seq2Pat, wherein Seq2Pat is used for a constraint-based sequential pattern mining algorithm, a sequential database representation method of a Multivalued Decision Diagram (MDD) is used, the item sequence and the attribute thereof are compactly coded by utilizing symmetry, the MDD representation is extended by constraint-specific information, if the MDD is prefix monotonous or prefix inverse monotonous, a constraint condition can be directly applied to the MDD, for the sequence S, the algorithm corresponds to an item S [ j ] at a position S [ j ] from node correlation, j = | S | and checks whether the item can be used for expanding a pattern which ends with any previous item I ∈ LJ' < LJ of the sequence; as long as the extension is feasible, an arc (u, v) is created between the various project nodes of the MDD; then, the algorithm increments the item at the j-1 position and repeats the same process; with the above construction, one node is connected to all nodes, representing a feasible extension with respect to imposed constraints; if the expansion of the item I to the item at S [ j '] at S [ j ] is not feasible, then guarantee that any expansion of I to the item S [ k ]: k ≧ j' is also not feasible;

2. The method for internet public opinion correlation deduction prediction analysis based on event chain as claimed in claim 1, wherein the step S1 further comprises performing data preprocessing on microblog texts, comprising the steps of:

s11, extracting keywords from the word-segmented Bowen keyword vocabulary set through a TextRank algorithm to obtain a Bowen keyword vocabulary set, and constructing an EMLo model;

3. The method as claimed in claim 1, wherein the step S2 includes performing cosine similarity calculation on each vectorized blog article, and obtaining node events through One-Pass clustering.

4. The method for internet public opinion correlation deduction prediction analysis based on event chain as claimed in claim 1, wherein the step S3 comprises the steps of:

5. The method as claimed in claim 1, wherein in the step S4, node events are labeled by Active Learning of Pool-based Sampling (Pool-based Sampling).

6. The method as claimed in claim 1, wherein in the step S5, a Seq2Pat constraint-based sequential pattern mining algorithm is used, a sequential database representation method of a multi-valued decision diagram (MDD) is used, and a sequence of items and their attributes are compactly encoded by using symmetry.

7. The method as claimed in claim 1, wherein in the step S6, a markov chain model is calculated according to the prior probability and transition probability of the node event, and an event chain state transition probability tree is generated

8. The method as claimed in claim 1, wherein in the step S7, through n-step markov chain, a steady state is calculated, and a probability map of event evolution of each child node is generated.