CN106844765B

CN106844765B - Significant information detection method and device based on convolutional neural network

Info

Publication number: CN106844765B
Application number: CN201710098500.7A
Authority: CN
Inventors: 谭铁牛; 王亮; 吴书; 余峰; 刘强
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-02-22
Filing date: 2017-02-22
Publication date: 2019-12-20
Anticipated expiration: 2037-02-22
Also published as: CN106844765A

Abstract

The invention discloses a significant information detection method and device based on a convolutional neural network. The method comprises the following steps: for the crawled data set, determining the time distribution of each event development stage and determining time nodes; for each event, dividing all event information corresponding to the event sample into a plurality of parts according to the determined time node, splicing text contents of the event information in each time phase into a paragraph, and generating a paragraph data set; learning an unsupervised expression vector for each paragraph in the paragraph dataset according to a distributed expression algorithm for the paragraph; for an event, inputting the unsupervised expression vector of each paragraph into a deep convolutional neural network model, obtaining the expression from the low layer to the high layer of each stage of the event by utilizing multilayer convolutional operation, extracting the key features of each stage of the event through k maximum pooling operation, and finally classifying the input information through a full connection layer.

Description

Significant information detection method and device based on convolutional neural network

Technical Field

The invention relates to the technical field of computer processing, in particular to a significant information detection method and device based on a convolutional neural network.

Background

The social media network is developed rapidly, is widely applied and easy to obtain, brings convenience to the life of users to a great extent on one hand, and enriches the experience of the users, but meanwhile, unreal information transmission on the social media network can disturb the normal life of people, misguides public sentiments, and endangers public safety and social stability. The task of identifying unrealistic information from a vast amount of social media network information is becoming more and more important and urgent, and early detection of unrealistic information is also becoming more practical and effective.

The existing method for identifying the unreal information is mainly a method of some feature engineering, and the extracted manual features can be derived from the following aspects, namely user credibility, microblog-level content, event-level content and aggregation from the microblog level to the event level. The extracted manual features can be roughly divided into the following categories, namely conflict viewpoints in microblogs, the feature of change of the microblog forwarding number along with time, microblog reply and signal microblogs containing suspected attitudes and the like. However, these manual feature-based methods are difficult to relate to new situations, and social media is dynamic, variable, and complex, which results in many new situations where manual features are difficult to design.

The CSID model can detect some significant information from the user generated content and generation time on social media, including but not limited to identification and early detection of rumor information. Generally, a microblog event includes thousands of related microblogs, and the microblog popularity varies greatly. Firstly, counting the time characteristics of unreal information and real information on a data set, wherein the time characteristics refer to the power law distribution characteristics of microblogs along with time. And then the microblog related to the event is grouped and processed by the model according to the corresponding time characteristic. For different groups of microblog texts, a representation learning method (representation learning method) is introduced into the model, and the expression of each group of microblog texts is learned by using a paragraph distributed expression learning algorithm (param vector). And finally, modeling high-order interaction among all groups of microblogs by using a deep convolutional neural network, performing a process of learning from low-order features to high-order features, learning implicit expressions (late representation) of all stages of event occurrence, and extracting important factors. Based on these implicit representations, the final representation of model events makes innovative contributions above the detection of non-information and early detection.

Disclosure of Invention

In view of the technical defects of the traditional artificial feature-based method, the invention provides a significant information detection method and device based on a convolutional neural network in order to better detect the information reliability.

According to an aspect of the present invention, a significant information detection method based on a convolutional neural network is provided, which includes the following steps:

step S1, for the crawled data set comprising a plurality of event information, determining the time distribution of each stage of the development of each event corresponding to the event information in the data set, and determining the time nodes corresponding to each time period; the event information in the data set comprises unreal event information and real time information, the event information corresponds to a plurality of events, and each event corresponds to a plurality of unreal event information or a plurality of real event information;

step S2, for each event, dividing all event information corresponding to the event sample into a plurality of parts according to the determined time node, splicing the text content of the event information in each time phase into a paragraph, and generating a paragraph data set;

step S3, learning an unsupervised expression vector of each paragraph in the paragraph data set according to a distribution expression algorithm of the paragraphs;

step S4, for an event, inputting the unsupervised expression vector of each paragraph into a deep convolutional neural network model, obtaining the expression from the bottom layer to the top layer of each stage of the event by utilizing multilayer convolutional operation, fully extracting the key features of each stage of the event through the k-max pooling operation, and finally classifying the input information through a full connection layer; after the deep convolutional neural network model is trained in the step S4 by using all events, a significant information detection model is obtained;

and step S5, classifying and detecting the information to be detected by using the significant information detection model.

Step S1 includes:

determining time stamps of all event information corresponding to the events;

for each event, sequencing the timestamps according to the time sequence;

equally dividing the time corresponding to the earliest time stamp and the latest time stamp into a plurality of time periods;

and determining time nodes corresponding to the multiple time periods.

Step S2 includes:

for each event, dividing the event information corresponding to the event into different time periods according to the time periods determined in step S1 and the time stamp of the event information corresponding to the event;

and splicing the text contents of the event information in each time period into a paragraph to obtain a plurality of paragraphs corresponding to the time periods to form a paragraph data set.

Step S3 includes:

and (3) regarding the paragraph data set as a corpus, and learning to obtain an unsupervised expression vector of each paragraph by using a distributed expression learning algorithm of unsupervised words and paragraphs on a word level and a paragraph level respectively.

Step S4 includes:

for each event, splicing the unsupervised vector expressions of all paragraphs into a matrix;

and inputting the matrix into a deep convolution neural network model for training.

According to a second aspect of the present invention, there is provided a significant information detection apparatus based on a convolutional neural network, comprising the steps of:

the time node determining module is configured to determine time distribution of each stage of development of each event corresponding to the event information in the data set and determine time nodes corresponding to each time period for the crawled data set comprising a plurality of event information; the event information in the data set comprises unreal event information and real time information, the event information corresponds to a plurality of events, and each event corresponds to a plurality of unreal event information or a plurality of real event information;

the paragraph generation module is configured to divide all event information corresponding to the event sample into a plurality of parts according to the determined time node for each event, and splice the text content of the event information in each time phase into a paragraph to generate a paragraph data set;

a vector generation module configured to learn an unsupervised expression vector for each paragraph in the paragraph dataset according to a distributed expression algorithm for the paragraph;

the model training module is configured to input the unsupervised expression vector of each paragraph into a deep convolutional neural network model for an event, obtain the expression from the bottom layer to the high layer of each stage of the event by utilizing multilayer convolutional operation, fully extract the key features of each stage of the event through the k-max pooling operation, and finally classify the input information through a full connection layer; after the deep convolutional neural network model is trained in the step S4 by using all events, a significant information detection model is obtained;

and the detection module is configured to utilize the significant information detection model to classify and detect the information to be detected.

The time node determination module:

the first determining submodule is configured to determine time stamps of all event information corresponding to the events;

a sorting submodule configured to sort the timestamps in chronological order for each event;

an equally dividing module configured to equally divide the time corresponding to the earliest time stamp and the latest time stamp into a plurality of time periods;

a second determining submodule configured to determine time nodes corresponding to the plurality of time periods.

The paragraph generation module includes:

the time period dividing submodule is configured to divide the event information corresponding to each event into different time periods according to the multiple determined time periods and the timestamp of the event information corresponding to the event;

and the paragraph generation submodule is configured to splice the text content of the event information in each time period into a paragraph, obtain a plurality of paragraphs corresponding to the plurality of time periods, and form a paragraph data set.

The vector generation module comprises:

and the unsupervised learning sub-module is configured to regard the paragraph data set as a corpus, and learn to obtain an unsupervised expression vector of each paragraph by using a distributed expression learning algorithm of unsupervised words and paragraphs on a word level and a paragraph level respectively.

The model training module comprises:

a splicing submodule configured to splice the unsupervised vector expressions of all paragraphs into one matrix for each event;

a training submodule configured to input the matrix to a deep convolutional neural network model for training.

Drawings

FIG. 1 is a schematic diagram of a significant information detection model CSID based on a convolutional neural network in the present invention;

FIG. 2 is a power-law distribution diagram of unreal information and real information on a microblog data set in the invention;

FIG. 3 is a schematic diagram illustrating comparison of early detection effects on a microblog data set by different comparison methods.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The invention discloses a Convolutional neural network-based significant Information Detection model (CSID) training method, which can be used for unreal Information identification and early Detection tasks in a social media network. The model may learn a holistic representation of events that contain microblogs of different orders of magnitude. Meanwhile, the CSID can also model each stage of event development according to the time characteristics of the event development, semantically express from a bottom layer to a high layer, select key characteristics through flexible k-max pooling operation, and transmit the key characteristics to a final full-connection layer for classification learning of social media network information. In the model, all microblogs contained in each event are divided into a plurality of groups according to the time phase of event development, each group of microblogs learns an expression and then sends the expression to a deep convolutional neural network, and finally the probability that the time belongs to unreal information is output. CSID model establishment: 1) for a large amount of crawled data sets of unreal information and real information, integrally researching the time distribution of each stage of event development, and determining time nodes corresponding to each time period; 2) for each event sample, dividing all microblogs into a plurality of parts according to the determined time nodes, and splicing the text contents of the microblogs in each time phase into a paragraph; 3) generating integral data sets into paragraphs, and learning an unsupervised expression vector of each paragraph according to a distribution expression algorithm of the paragraphs; 4) for an event sample, inputting an expression vector of each stage into a deep convolution neural network model, obtaining the expression from the bottom layer to the high layer of each stage of the event by utilizing multilayer convolution operation, fully extracting key features of each stage of the event through flexible k-max pooling operation, and finally classifying input information through a full connection layer; 5) on the test set, by gradient back propagation, a visual experiment is carried out on the convolution kernel and the gradient, and the significant information learned by the model is deeply analyzed and demonstrated. On the experiment of the Sina microblog data set and the twitter data set, a more accurate prediction effect is obtained compared with other existing models.

As shown in fig. 1, an embodiment of the present invention provides a significant information detection method based on a convolutional neural network, where the method includes:

receiving information to be classified;

inputting the information to be classified into a pre-trained significant information detection model;

and the significant information detection model outputs the result that the information to be classified is real information or unreal information.

In an embodiment, the salient information monitoring model firstly trains the model well according to the existing data, after the trained model is obtained, new information is input into the model for the newly appeared information through similar operation, and then the model outputs a probability value which represents the probability that the input information belongs to unreal information, and the larger the output value is, the more probable the input information is the unreal information.

The following describes in detail various problems involved in the technical solutions of the present invention with reference to the accompanying drawings. It should be noted that the described embodiments are only intended to facilitate understanding and do not have any limiting effect on the invention.

In order to better understand the role of the CSID model in the unreal information detection and verify the implementation effect of the present invention, experiments are taken as an example to explain, and the example adopts the xinlang microblog database. The experimental data set was divided into 60% training set, 30% testing set and 10% validation set.

The experiment contained four evaluation indices Accuracy (Accuracy), Precision (Precision), Recall (Recall) and F1-score. The research object respectively calculates Precision and Recall for unreal information and real information to display the capability of the model to detect the two kinds of information. The larger the values of the four evaluation indexes are, the higher the detection performance of the unreal information of the model is.

As shown in fig. 1, the specific experimental steps on the Sina microblog data set are as follows:

in step S1, a plurality of events E ═ E are included in the data set of the large amount of the crawled unreal information and real information_iFor an event, a plurality of pieces of information may be used to describe the event, for example, for a significant time, there may be a plurality of pieces of information such as microblogs or news to describe the event), the time distribution of each stage of the event development is studied as a whole, the timestamps (i.e., the time points at which the information is issued) of all the microblogs (here, the microblogs are taken as an example, and other information may also be collected) corresponding to all the events are firstly collected, the timestamps are arranged according to the time sequence, then the time periods corresponding to the earliest and latest timestamps are equally divided into M (for example, M is 20) and the time nodes corresponding to each time period are determined accordingly,

T_i＝[t_i-1，t_i)，i＝1，2，…，20.

wherein T is_iDenotes the ith time period, t_i-1And t_iRespectively representing the ith time phase start timestamp and the ith time phase end timestamp. In addition, each time node needs to be normalized, and the timestamp corresponding to the obtained time node is normalized to 0-1 interval.

Step S2, for each microblog containing multiple microblogsEvent sample ofFirstly, the first step is toThe time stamp t of all microblogs included in the event_jNormalizing to an interval of 0-1, dividing all microblogs into a plurality of parts according to the time nodes determined by S1, and splicing the text contents of the microblogs in each time phase into a paragraph, namely, the time stamp of the microblogs is in the ith time phase T_iThe contents of all microblogs in the microblog list are spliced into a paragraph.

Step S3, regarding all microblog content text data sets in the step S2 as a corpus, learning to obtain expression vectors of each word and each paragraph by using unsupervised word and paragraph distributed expression learning algorithms word2vec and para2vec on the word level and the paragraph level respectively, and forming matrixes W and D respectively. Each column in the matrices W and D corresponds to an expression vector for a word and a paragraph, respectively.

Wherein N represents the number of words in a paragraph, the window width of the context is 2k, namely, k words before and after the current word are selected as the context, the algorithm maximizes the joint condition distribution probability p of all words in the paragraph mainly through the words of the context and the memory information in the paragraph expression vector, and the probability p is calculated through softmax. y is_iThe output response, which represents the ith word, can be derived from,

y＝b+U^Th(p_j，w_n-k，…w_n+k；D，W)

wherein p is_jIs a vector representation of a paragraph, w_nVector expression, p, representing the nth word in a paragraph_jAnd w_nOne column in each of the matrices D and W. b and U are parameters of softmax, and h is an averaging or splicing operation.

Step S4, for an event sample, the paragraph in S3 is expressed as a vector p_jSpliced into a matrixWherein d and n represent the dimension of the matrix P, input into the deep convolutional neural network model, and utilize multipleThe layer convolution operation obtains the expression from the bottom layer to the high layer of each stage of the event, the output result of a certain layer in the deep neural network model is called a feature map, the output result of the low layer of the neural network is called a low-order feature map, the output result of the high layer of the neural network is called a high-order feature map, one element of the feature map can be obtained through the following convolution operation,

f[i]＝tanh(<P[:,i:i+ω-1],C>_F)

where P [: i + w-1] represents the i-th to (i + ω -1) -th columns of the matrix E, ω represents the width of the convolution kernel, and C represents the convolution weight matrix. The operation of the trace after matrix multiplication can be represented as a Frobenius inner product operation as follows:

<X,Y>_F＝Tr(XY^T)

fully extracting key features of each stage of an event through flexible k maximum pooling operation, namely extracting k maximum elements in a feature mapAs a new characteristic diagram. And finally, classifying the input information by a full connection layer.

The deep convolutional neural network model can be initialized randomly and then trained continuously in S4 to update the parameters of the model.

And step S5, obtaining a gradient matrix of the input label on the input through gradient back transmission on the test set, and performing significance analysis on the input matrix to obtain microblog content playing a significant role in the corresponding input. In addition, deep visual analysis is carried out on the convolution kernel of the first convolution layer, and the distribution characteristics of the microblog content in the event are obtained.

FIG. 2 is a power-law distribution diagram of unreal information and real information on a microblog data set in the invention; in the data set shown in fig. 2, for real information and unreal information, the power law distribution of the microblog numbers over time is reflected by the change of the proportion of the microblog numbers in different stages over time. Fig. 3 shows the experimental results of early detection of unreal information.

Table 1 shows the attribute statistics in the Twitter and Weibo datasets

Table 2: identification of unrealistic information (M: unrealistic information, T: real information)

Table 2 shows the experimental results of the proposed CSID method compared to other methods available

The model provided by the invention discloses a power law distribution rule of the microblog quantity contained in the events in the social media network along with time, time nodes of each stage of the events are determined by adopting integral equal division according to the rule, and then each event is segmented according to the time stages, so that the microblog quantity with the same quantity in each time interval is ensured, and the events can be ensured to share one time scale on the whole. The model can learn more real expression of events and can fully mine and utilize the time law of information distribution. The expression from the bottom layer to the high layer of each stage of the event is obtained by utilizing multilayer convolution operation, so that high-order interaction and deep semantic expression of each stage of the event can be fully modeled; the key features of each stage of the event are fully extracted through flexible k-max pooling operation, so that the model can be more suitable for dynamic complex social media scenes.

The invention relates to a significant information detection task based on a convolutional neural network, and particularly aims at real social media occasions, such as large information quantity, obvious time span difference, complex semantic scenes, dynamic and variable user behaviors and the like, so that more accurate detection effect can be obtained by significant information detection.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A significant information detection method based on a convolutional neural network comprises the following steps:

step S1, for the crawled data set comprising a plurality of event information, determining the time distribution of each stage of the development of each event corresponding to the event information in the data set, and determining the time nodes corresponding to each time period; the event information in the data set comprises unreal event information and real event information, the data set corresponds to a plurality of events, and each event corresponds to at least one unreal event information and/or at least one real event information;

step S3, learning an unsupervised expression vector of each paragraph in the paragraph data set according to a distributed expression learning algorithm of the paragraph;

step S4, for an event, inputting the unsupervised expression vector of each paragraph into a deep convolutional neural network model, obtaining the expression from the bottom layer to the top layer of each stage of the event by utilizing multilayer convolutional operation, extracting the key features of each stage of the event through the k-max pooling operation, and finally classifying the input information through a full connection layer; after the deep convolutional neural network model is trained in the step S4 by using all events, a significant information detection model is obtained;

2. The method according to claim 1, wherein step S1 includes:

determining time stamps of all event information corresponding to the events;

for each event, sequencing the timestamps according to the time sequence;

and determining time nodes corresponding to the multiple time periods.

3. The method according to claim 1, wherein step S2 includes:

4. The method according to claim 1, wherein step S3 includes:

5. The method according to claim 1, wherein step S4 includes:

6. A significant information detection device based on a convolutional neural network comprises the following steps:

a vector generation module configured to learn an unsupervised expression vector for each paragraph in the paragraph dataset according to a distributed expression learning algorithm for the paragraph;

the model training module is configured to input the unsupervised expression vector of each paragraph into a deep convolutional neural network model for an event, obtain the expression from the bottom layer to the high layer of each stage of the event by utilizing multilayer convolutional operation, fully extract the key features of each stage of the event through the k-max pooling operation, and finally classify the input information through a full connection layer; after the deep convolutional neural network model is trained by using all events, a significant information detection model is obtained;

7. The apparatus of claim 6, wherein the time node determination module:

8. The apparatus of claim 6, wherein the paragraph generation module comprises:

9. The apparatus of claim 6, wherein the vector generation module comprises:

10. The apparatus of claim 6, wherein the model training module comprises: