CN115563573A

CN115563573A - Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction

Info

Publication number: CN115563573A
Application number: CN202210974704.3A
Authority: CN
Inventors: 李淑真; 叶周盛; 王雪岭; 袁成武; 徐莼; 冯星宇
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2023-01-03

Abstract

The invention discloses an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction, which comprises the following steps: multi-modal feature extractor extracts text features r _t Image feature r _v And user characteristics r _u (ii) a The cross-modal relationship extractor pairs the text features r according to the associations between the modalities _t Image feature r _v And user characteristics r _u Updating to obtain enhanced text features u _t Enhancing image feature u _v And enhanced user features u _u (ii) a The multimodal feature fuser receives text features r _t Image feature r _v User characteristics r _u Enhanced text feature u _t Enhancing image features u _v And enhanced user features u _u And obtaining the multi-modal fusion characteristics a through dynamic allocation of a dynamic routing mechanism ^N (ii) a The classifier receives the multi-modal fused features a ^N And outputs the prediction result. The information detection method based on modal dynamic feature fusion and cross-modal relationship extraction realizes rumor detection with higher precision by constructing the cross-modal relationship and dynamic feature fusion.

Description

Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction

Technical Field

The invention relates to an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction.

Background

Currently, rumor detection methods include machine learning methods and deep learning methods. Wherein, the traditional machine learning method uses a propagation tree and a propagation tree kernel to model microblog to detect rumors. Rumor two classification was performed using supervised learning using n-grams and bag-of-words models. For deep learning methods, a Recurrent Neural Network (RNN) is employed to capture changes in context information to differentiate rumors. The higher-performance rumor detection model is realized by mutual confrontation of a text generator (generator) and a discriminator (discriminator). Compared with a machine learning model, the existing deep learning model has more excellent feature extraction capability, so that the performance of the existing deep learning model is stronger. However, in the case of rumors with diverse forms of pictures, texts, etc., the existing deep learning methods still need further exploration.

Disclosure of Invention

The invention provides an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction, which solves the above mentioned technical problems and specifically adopts the following technical scheme:

an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction comprises the following steps:

the multi-mode feature extractor receives information to be detected containing text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively _t Image feature r _v And user characteristics r _u ；

Cross-modal relationship extractor receives text features r _t Image feature r _v And user characteristics r _u Establishing the association among the modals, and aligning the text features r according to the association among the modals _t Image feature r _v And user characteristics r _u Updating to obtain enhanced text features u _t Enhancing image features u _v And enhanced user features u _u ；

The multimodal feature fuser receives text features r _t Image feature r _v User characteristics r _u Enhanced text feature u _t Enhancing image feature u _v And enhanced user features u _u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining a multi-modal fusion characteristic a after multiple iterations ^N ；

The classifier receives a multi-modal fused feature a ^N And outputs the prediction result.

Further, the multi-modal feature extractor includes a text feature extractor, an image feature extractor, and a user feature extractor.

Further, the text feature extractor comprises a BERT model and a full connection layer;

the text feature extractor extracts a text feature r from the text information _t The specific method comprises the following steps:

performing one-of-N encoding and expansion on the text information, expanding the length of a sentence to 512 to obtain an encoding vector E, and inputting the encoding vector E into a BERT model to obtain an output matrix B = [ B ] _[CLS] ,b ₁ ,...,b _n ,b _|text| ,...,b ₅₁₀ ] ^T Wherein b is _[CLS] Representing all semantic information in the text information;

b is to be _[CLS] Inputting the full connection layer of the text feature extractor to obtain the text feature r by the following calculation _t ，

r _t ＝W _tf ·b _[CLS]

Wherein, W _tf Representing text feature informationAnd taking the weight matrix of the full connection layer of the device.

Further, the image feature extractor comprises a VGG19 network and a full connection layer;

the image feature extractor extracts an image feature r from the image information _v The specific method comprises the following steps:

inputting image information into VGG19 network to obtain image feature representation r _VGG ，

Characterizing an image r _VGG The image characteristic r is obtained by performing the following calculation on the full connection layer of the input image characteristic extractor _v ，

r _v ＝W _vf ·r _VGG

Wherein, W _vf A weight matrix representing the fully connected layers of the image feature extractor.

Further, the user feature extractor extracts the user features of the user information by using a method of combining the manual features with the deep learning model.

Further, the user feature extractor comprises a full connectivity layer;

the user feature extractor extracts user features r from the user information _u The specific method comprises the following steps:

coding user information through manual characteristics to obtain manual characteristics r _raw ，

Will code good manual characteristics r _raw Inputting the full connection layer of the user feature extractor to obtain the user feature r by the following calculation _u ，r _u ＝W _uf ·r _raw

Wherein, W _uf The weight matrix of the fully connected layer of the user feature extractor.

Further, the cross-modal relationship extractor comprises three fully connected layers and a cross-modal function module;

cross-modal relationship extractor for text features r _t Image feature r _v And user characteristics r _u Updating to obtain enhanced text features u _t Enhancing image feature u _v And enhanced user features u _u The specific method comprises the following steps:

text feature r _t Image feature r _v And user characteristics r _u Composition multimodal features R = [ R ] _t ,r _v ,r _u ] ^T ，

Inputting the multi-modal characteristics R into three full connection layers of a cross-modal relationship extractor, and respectively generating a key characteristic matrix K through the following calculation _R Query feature matrix Q _R Sum value feature matrix V _R ，

Wherein, W _K ，W _Q And W _V Respectively a parameter matrix of three fully-connected layers,

and calculating the enhanced text characteristic u by the following formula _t Enhancing image feature u _v And enhanced user features u _u ，

Further, the multi-modal feature fuser calculates multi-modal fused features a ^N The specific method comprises the following steps:

text feature r _t Image feature r _v User characteristics r _u Enhanced text feature u _t Enhancing image feature u _v And enhanced user features u _u Respectively inputting six full-connection layers to obtain six eigenvectors through the following calculation

And

wherein the content of the first and second substances,

and

respectively a parameter matrix of six fully-connected layers,

the multi-modal feature fuser receives the six feature vectors

And

and dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and finally obtaining the multi-modal fusion characteristic a after multiple iterations ^N 。

Further, the classifier receives the multi-modal fused feature a ^N The specific method for outputting the prediction result comprises the following steps:

the classifier obtains the prediction probability by the following calculation

Wherein, W _p1 And W _p2 Is a learnable parameter matrix, b _p1 、b _p1 Is an offset term, sigmoid and LeakyReLU are activation functions.

Further, the model is optimized by minimizing the cross entropy, the loss function is defined as follows,

wherein, Θ represents all learnable parameters of the whole neural network, y represents the label of the information to be detected, rumor is 1, and nonrumor is 0.

The method has the advantages that the modal dynamic feature fusion and cross-modal relationship extraction-based information detection method realizes rumor detection with higher precision by constructing the cross-modal relationship and the dynamic feature fusion.

Drawings

FIG. 1 is a schematic diagram of a prediction model DFCM of the present invention;

FIG. 2 is a schematic diagram of an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction according to the present invention;

FIG. 3 is a schematic diagram of a text feature extractor of the present invention;

FIG. 4 is a schematic diagram of an image feature extractor of the present invention;

FIG. 5 is a schematic diagram of a cross-modal relationship extractor of the present invention;

FIG. 6 is a schematic diagram of the multi-modal feature fuser of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the embodiments.

FIG. 1 shows a prediction model DFC disclosed in the present applicationAnd M, which comprises a multi-modal feature extractor, a cross-modal relationship extractor, a multi-modal feature fusion device and a classifier. Fig. 2 shows an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction, which is implemented based on a prediction model DFCM. Specifically, the information detection method based on modal dynamic feature fusion and cross-modal relationship extraction comprises the following steps: s1: the multi-modal feature extractor receives information to be detected including text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively _t Image feature r _v And user characteristics r _u . S2: cross-modal relationship extractor receives text features r _t Image feature r _v And user characteristics r _u Establishing the association among the modals, and aligning the text features r according to the association among the modals _t Image feature r _v And user characteristics r _u Updating to obtain enhanced text features u _t Enhancing image features u _v And enhanced user features u _u . S3: the multimodal feature fuser receives text features r _t Image feature r _v User characteristics r _u Enhanced text feature u _t Enhancing image feature u _v And enhanced user features u _u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining the multi-modal fusion characteristic a after multiple iterations ^N . S4: the classifier receives a multi-modal fused feature a ^N And outputs the prediction result. Through the steps, the modal dynamic feature fusion and cross-modal relationship extraction-based information detection method achieves rumor detection with higher precision by constructing the cross-modal relationship and dynamic feature fusion. The above steps are specifically described below.

For step S1: the multi-modal feature extractor receives information to be detected including text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively _t Image feature r _v And user characteristics r _u 。

The prediction model DFCM is mainly used for predicting whether a post d in a forum or a microblog is a rumor. Therefore, the information to be detected is a post d, and the post d specifically includes text information d.text, image information d.image and user information d.user. First, the multimodal feature extractor extracts feature information from the post d.

In particular, the multimodal feature extractor includes a text feature extractor, an image feature extractor, and a user feature extractor.

As shown in fig. 3, the text feature extractor includes a BERT model and a fully connected layer.

text information d.text is one-of-N encoded and expanded, and the length of the sentence is expanded to 512. This length is the input length limit of the BERT model, resulting in the code vector E. E = [ E = _[CLS] ,e ₁ ,...,e _|text| ,e _[SEP] ,...,e ₅₁₀ ]Where e represents a word, | text | represents the length of the input text, [ CLS [ ]]Is a sentence start identifier, [ SEP]Is a sentence end identifier, [ SEP]Followed by an extension portion.

Text feature r for effectively extracting text information d _t The present application employs a pre-trained BERT model. The BERT model is a multi-layer bi-directional transform encoder. Inputting the encoding vector E into a BERT model to obtain an output matrix B = [ B = [) _[CLS] ,b ₁ ,...,b _n ,b _|text| ,...,b ₅₁₀ ] ^T Wherein b is _[CLS] Representing all semantic information in the text information.

Wherein d is _B Is the output dimension of the bi-directional transcoder model.

B is to _[CLS] Inputting the full connection layer of the text feature extractor to obtain the text feature r by the following calculation _t ，

r _t ＝W _tf ·b _[CLS]

Wherein, W _tf Representing text feature extractorThe weight matrix of the connection layer.

d _m To hide layer dimensions, the features

As shown in fig. 4, the image feature extractor contains a VGG19 network and a full connectivity layer. Image characteristic r of image information d.image is extracted effectively _v According to the method, the pre-trained VGG19 network is adopted to firstly extract the object information in the picture, and a full connection layer (visual-fc) is added to the last layer of the VGG19 network, so that on one hand, the size of the image feature can be adjusted, the image feature dimension and the text feature dimension are unified, and preparation is made for multi-modal feature fusion. Since VGG19 was not re-fitted during training, the fully connected layers could further extract features in the pictures relevant to rumor detection.

The image feature extractor extracts image features r from the image information _v The specific method comprises the following steps:

inputting image information d.image into VGG19 network to obtain image feature representation r _VGG ，

d _v Is the output dimension of the VGG19 network.

r _v ＝W _vf ·r _VGG

Wherein d is _m Is the hidden layer dimension.

In the application, the user feature extractor extracts the user features of the user information by adopting a method of combining manual features and a deep learning model. The manual characteristics of the user information d.user are shown in table 1.

The user feature extractor includes a fully connected layer.

user information d.user is coded through manual characteristics to obtain manual characteristics r _raw ，

Wherein d is _u Is a manual feature dimension.

Will encode good manual characteristics r _raw Inputting the full connection layer of the user characteristic extractor to obtain the user characteristic r by the following calculation _u ，

r _u ＝W _uf ·r _raw

Wherein, W _uf The weight matrix for the fully-connected layer of the user feature extractor,

for step S2: the cross-modal relationship extractor receives a text feature r _t Image feature r _v And user characteristics r _u Establishing the association among the modalities, and aligning the text characteristics r according to the association among the modalities _t Image feature r _v And user characteristics r _u Updating to obtain enhanced text features u _t Enhancing image features u _v And enhanced user features u _u 。

As shown in fig. 5, the cross-modal relationship extractor of the present application comprises three fully connected layers and one crossstacking function module.

Cross-modal relationship extractor pair text features r _t Image feature r _v And use ofCharacteristic of house r _u Updating to obtain enhanced text features u _t Enhancing image features u _v And enhanced user features u _u The specific method comprises the following steps:

Inputting the multi-modal characteristics R into three full-connection layers of a cross-modal relationship extractor, and respectively generating a key characteristic matrix K through the following calculation _R Query feature matrix Q _R Sum value feature matrix V _R ，

Wherein, W _K ，W _Q And W _V Parameter matrices, K, of three fully-connected layers, respectively _R The matrix being used to match the characteristics of other modalities, Q _R The matrix is used to wait for the matching of other modal characteristics, V _R The matrix serves as the values to be summed.

Then, a cross-modal relationship between the modalities is established, and a key feature matrix K is matched _R And query feature matrix Q _R The similarity matrix between different modalities is calculated by scaling the dot product, and the calculation process is as follows.

d _m Is W _K Plays a role of scaling. For simplicity of derivation, the softmax and scaling functions in the above equation are omitted and the equation can be extended as follows:

for the above formula, the text feature r is used _t For illustration, first, a text feature r is described _t Will pass throughQuery vector q _t Calculating a text feature r _t Similarity with all modal characteristics, then weighting and summing the similarity as weight, and finally calculating to obtain updated enhanced text characteristics u _t . In the process, the updating process of other modes is similar, namely, a cross-mode relation is established.

Finally, the enhanced text feature u is obtained _t Enhancing image features u _v And enhanced user features u _u The result can be expressed as U = [ U ] _t ,u _v ,u _u ] ^T Wherein

For step S3: the multimodal feature fuser receives text features r _t Image feature r _v User characteristics r _u Enhanced text feature u _t Enhancing image feature u _v And enhanced user features u _u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining the multi-modal fusion characteristic a after multiple iterations ^N 。

As shown in FIG. 6, the multi-modal feature fuser calculates multi-modal fused features a ^N The specific method comprises the following steps:

text feature r _t Image feature r _v User characteristics r _u Enhanced text feature u _t Enhancing image feature u _v And enhanced user features u _u Respectively inputting six full-connection layers, and obtaining six eigenvectors through the following calculation

And

wherein the content of the first and second substances,

and

the parameter matrixes are six full-connection layers respectively, and the multi-mode feature fusion device receives the six feature vectors

And

The dynamic routing mechanism of the multi-modal feature fusion device of the present application adopts the dynamic routing method proposed by Sabour.

Dynamic routing is a mechanism for outputting vectors from capsules, and powerful dynamic routing mechanisms can ensure that the output of a capsule is sent to the appropriate parent in the upper layer.

In a fully connected neural network, neurons can be calculated by the following formula:

W _ij the resulting parameter matrix is trained using a back propagation algorithm through a global function. Iterative dynamic routing provides an alternative method of calculating how a capsule is activated by using attributes of local features. Such a method allows better and simpler combination of the inputs to form an analytic tree with lower risk to the wind. In dynamic routing the output is routed to all possible parent nodes but scaled down by adding to a 1 coupling factor. For each possible parent node, each parent node is multiplied by a weight matrix in each iteration round to compute a "prediction vector". If the prediction vector has a large scalar product with a possible parent output, there is a top-down feedback, increasing the coupling coefficient of the node and decreasing the coupling coefficients of the other nodes. This type of "protocol routing" can be much more efficient than the very primitive form of routing that is implemented for maximum pooling.

Specifically, the process of dynamic routing iteration of the present application is as follows,

where N is the iteration round, b ^r The "square" function reduces the short vectors to a length close to zero and the long vectors to a length slightly less than 1 without changing the vector direction. In each iteration turn of the dynamic routing, a prediction vector 'a' is obtained by weighted summation and a 'square' function ^r If the feature vector j is input _i And "prediction vector" a ^r With a larger dot product (i.e., more similar), the next iteration round increases the coupling coefficient c of the feature vector _i Reducing the coupling coefficient of other eigenvectorsAnd (4) counting. Eventually, the output of the dynamic routing will tend to be the most prominent modal features, and other modal features are also fused. The result is the feature fusion result after the iteration output of the N rounds of dynamic routing

For step S4: the classifier receives a multi-modal fused feature a ^N And outputs the prediction result.

Further, the feature a is fused ^N Inputting the predicted post d into the classifier to obtain the predicted result of the post d. The classifier outputs a prediction probability that a post is rumor

Wherein, W _p1 And W _p2 Is a learnable parameter matrix.

d _p Is the dimension of the classifier. b is a mixture of _p1 、b _p1 Is an offset term and sigmoid and LeakyReLU are activation functions.

Further, the model is optimized by minimizing the cross entropy, the loss function is defined as,

The microblog data sets were tested by different prediction models, and the results are shown in table 2. The accuracy, precision, recall and F1 score of the model DFCM proposed herein are 86.42%, 84.26%, 90.61% and 87.32% respectively, which are superior to other models.

Table 2 experimental results of different models on microblog data sets

The twitter data sets were subjected to experiments with different predictive models and the results are shown in table 3. The accuracy, precision, recall and F1 score of the model DFCM proposed herein are 88.64%, 91.93%, 91.01% and 91.47%, respectively, which are also superior to other models.

Table 3 experimental results of different models on twitter data sets

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalents or equivalent changes fall within the protection scope of the present invention.

Claims

1. An information detection method based on modal dynamic feature fusion and cross-modal relationship extraction is characterized by comprising the following steps:

the multi-modal feature extractor receives information to be detected comprising text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively _t Image feature r _v And user characteristics r _u ；

Receiving the text feature r by the cross-modal relation extractor _t The image feature r _v And said user characteristic r _u Establishing the association among the modes, and aligning the text features r according to the association among the modes _t The image feature r _v And said user characteristic r _u Updating to obtain enhanced text features u _t Enhancing image features u _v And increaseStrong user characteristics u _u ；

A multi-modal feature fuser receives the text features r _t The image feature r _v The user characteristic r _u The enhanced text feature u _t The enhanced image feature u _v And said enhanced user features u _u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining the multi-modal fusion characteristic a after multiple iterations ^N ；

The classifier receives the multi-modal fusion feature a ^N And outputs the prediction result.

2. The method for detecting information based on modal dynamic feature fusion and cross-modal relationship extraction according to claim 1,

the multi-modal feature extractor includes a text feature extractor, an image feature extractor, and a user feature extractor.

3. The method for detecting information based on modal dynamic feature fusion and cross-modal relationship extraction according to claim 2,

the text feature extractor comprises a BERT model and a full connection layer;

the text feature extractor extracts the text feature r from the text information _t The specific method comprises the following steps:

performing one-of-N coding and expansion on the text information, expanding the length of a sentence to 512 to obtain a coding vector E, and inputting the coding vector E into a BERT model to obtain an output matrix B = [ B ] _[CLS] ,b ₁ ,...,b _n ,b _|text| ,...,b ₅₁₀ ] ^T Wherein b is _[CLS] Representing all semantic information in the text information;

r _t ＝W _tf ·b _[CLS]

Wherein，W _tf A weight matrix representing a fully connected layer of the text feature extractor.

4. The method according to claim 3, wherein the information detection method based on modal dynamic feature fusion and cross-modal relationship extraction,

the image feature extractor comprises a VGG19 network and a fully connected layer;

the image feature extractor extracts the image feature r from the image information _v The specific method comprises the following steps:

inputting the image information into a VGG19 network to obtain an image feature representation r _VGG ，

Representing the image features r _VGG Inputting the image feature r to the full-connection layer of the image feature extractor to obtain the image feature r by the following calculation _v ，

r _v ＝W _vf ·r _VGG

Wherein, W _vf A weight matrix representing a fully connected layer of the image feature extractor.

5. The method according to claim 4, wherein the information detection method based on modal dynamic feature fusion and cross-modal relationship extraction,

the user feature extractor extracts the user features of the user information by adopting a method of combining manual features and a deep learning model.

6. The method according to claim 5, wherein the information extraction method based on modal dynamic feature fusion and cross-modal relationship extraction,

the user feature extractor comprises a full connectivity layer;

the user feature extractor extracts the user feature r from the user information _u The specific method comprises the following steps:

coding the user information through manual characteristics to obtain manual characteristics r _raw ，

The manual features r to be coded _raw Inputting the full connection layer of the user feature extractor to perform the following calculation to obtain the user feature r _u ，

r _u ＝W _uf ·r _raw

Wherein, W _uf A weight matrix for a fully connected layer of the user feature extractor.

7. The method according to claim 6, wherein the information extracted by the modal-based dynamic feature fusion and cross-modal relationship is extracted,

the cross-modal relationship extractor comprises three fully connected layers and a cross filter function module;

the cross-modal relationship extractor pair the text features r _t The image feature r _v And said user characteristic r _u Updating to obtain enhanced text features u _t Enhancing image feature u _v And enhanced user features u _u The specific method comprises the following steps:

the text characteristic r is combined _t The image feature r _v And said user characteristic r _u Composition multimodal features R = [ R ] _t ,r _v ,r _u ] ^T ，

Inputting the multi-modal characteristics R into three full-connection layers of the cross-modal relationship extractor, and respectively generating a key characteristic matrix K through the following calculation _R Query feature matrix Q _R Sum value feature matrix V _R ，

and calculating the enhanced text characteristic u by the following formula _t Enhancing image features u _v And enhanced user features u _u ，

8. The method according to claim 7, wherein the information extracted by the modal-based dynamic feature fusion and cross-modal relationship is extracted,

the multi-modal feature fuser calculates the multi-modal fused features a ^N The specific method comprises the following steps:

the text characteristic r is combined _t The image feature r _v The user characteristic r _u The enhanced text feature u _t The enhanced image feature u _v And said enhanced user features u _u Respectively inputting six full-connection layers to obtain six eigenvectors through the following calculation

And

wherein the content of the first and second substances,

and

respectively, a parameter matrix of six fully-connected layers,

the multi-modal feature fuser receives the six feature vectors

And

9. The method for information detection based on modal dynamic feature fusion and cross-modal relationship extraction according to claim 8,

the classifier receives the multi-modal fusion feature a ^N The specific method for outputting the prediction result comprises the following steps:

the classifier obtains the prediction probability by the following calculation

10. The method according to claim 9, wherein the information extraction method based on modal dynamic feature fusion and cross-modal relationship extraction,

the model is optimized by minimizing the cross entropy, the loss function is defined as,

wherein Θ represents all learnable parameters of the whole neural network, y represents the label of the information to be detected, and the rumor is 1 and the nonrumor is 0.