CN115563573A - Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction - Google Patents

Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction Download PDF

Info

Publication number
CN115563573A
CN115563573A CN202210974704.3A CN202210974704A CN115563573A CN 115563573 A CN115563573 A CN 115563573A CN 202210974704 A CN202210974704 A CN 202210974704A CN 115563573 A CN115563573 A CN 115563573A
Authority
CN
China
Prior art keywords
feature
modal
text
user
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210974704.3A
Other languages
Chinese (zh)
Inventor
李淑真
叶周盛
王雪岭
袁成武
徐莼
冯星宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202210974704.3A priority Critical patent/CN115563573A/en
Publication of CN115563573A publication Critical patent/CN115563573A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction, which comprises the following steps: multi-modal feature extractor extracts text features r t Image feature r v And user characteristics r u (ii) a The cross-modal relationship extractor pairs the text features r according to the associations between the modalities t Image feature r v And user characteristics r u Updating to obtain enhanced text features u t Enhancing image feature u v And enhanced user features u u (ii) a The multimodal feature fuser receives text features r t Image feature r v User characteristics r u Enhanced text feature u t Enhancing image features u v And enhanced user features u u And obtaining the multi-modal fusion characteristics a through dynamic allocation of a dynamic routing mechanism N (ii) a The classifier receives the multi-modal fused features a N And outputs the prediction result. The information detection method based on modal dynamic feature fusion and cross-modal relationship extraction realizes rumor detection with higher precision by constructing the cross-modal relationship and dynamic feature fusion.

Description

Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction
Technical Field
The invention relates to an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction.
Background
Currently, rumor detection methods include machine learning methods and deep learning methods. Wherein, the traditional machine learning method uses a propagation tree and a propagation tree kernel to model microblog to detect rumors. Rumor two classification was performed using supervised learning using n-grams and bag-of-words models. For deep learning methods, a Recurrent Neural Network (RNN) is employed to capture changes in context information to differentiate rumors. The higher-performance rumor detection model is realized by mutual confrontation of a text generator (generator) and a discriminator (discriminator). Compared with a machine learning model, the existing deep learning model has more excellent feature extraction capability, so that the performance of the existing deep learning model is stronger. However, in the case of rumors with diverse forms of pictures, texts, etc., the existing deep learning methods still need further exploration.
Disclosure of Invention
The invention provides an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction, which solves the above mentioned technical problems and specifically adopts the following technical scheme:
an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction comprises the following steps:
the multi-mode feature extractor receives information to be detected containing text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively t Image feature r v And user characteristics r u
Cross-modal relationship extractor receives text features r t Image feature r v And user characteristics r u Establishing the association among the modals, and aligning the text features r according to the association among the modals t Image feature r v And user characteristics r u Updating to obtain enhanced text features u t Enhancing image features u v And enhanced user features u u
The multimodal feature fuser receives text features r t Image feature r v User characteristics r u Enhanced text feature u t Enhancing image feature u v And enhanced user features u u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining a multi-modal fusion characteristic a after multiple iterations N
The classifier receives a multi-modal fused feature a N And outputs the prediction result.
Further, the multi-modal feature extractor includes a text feature extractor, an image feature extractor, and a user feature extractor.
Further, the text feature extractor comprises a BERT model and a full connection layer;
the text feature extractor extracts a text feature r from the text information t The specific method comprises the following steps:
performing one-of-N encoding and expansion on the text information, expanding the length of a sentence to 512 to obtain an encoding vector E, and inputting the encoding vector E into a BERT model to obtain an output matrix B = [ B ] [CLS] ,b 1 ,...,b n ,b |text| ,...,b 510 ] T Wherein b is [CLS] Representing all semantic information in the text information;
b is to be [CLS] Inputting the full connection layer of the text feature extractor to obtain the text feature r by the following calculation t
r t =W tf ·b [CLS]
Wherein, W tf Representing text feature informationAnd taking the weight matrix of the full connection layer of the device.
Further, the image feature extractor comprises a VGG19 network and a full connection layer;
the image feature extractor extracts an image feature r from the image information v The specific method comprises the following steps:
inputting image information into VGG19 network to obtain image feature representation r VGG
Characterizing an image r VGG The image characteristic r is obtained by performing the following calculation on the full connection layer of the input image characteristic extractor v
r v =W vf ·r VGG
Wherein, W vf A weight matrix representing the fully connected layers of the image feature extractor.
Further, the user feature extractor extracts the user features of the user information by using a method of combining the manual features with the deep learning model.
Further, the user feature extractor comprises a full connectivity layer;
the user feature extractor extracts user features r from the user information u The specific method comprises the following steps:
coding user information through manual characteristics to obtain manual characteristics r raw
Will code good manual characteristics r raw Inputting the full connection layer of the user feature extractor to obtain the user feature r by the following calculation u ,r u =W uf ·r raw
Wherein, W uf The weight matrix of the fully connected layer of the user feature extractor.
Further, the cross-modal relationship extractor comprises three fully connected layers and a cross-modal function module;
cross-modal relationship extractor for text features r t Image feature r v And user characteristics r u Updating to obtain enhanced text features u t Enhancing image feature u v And enhanced user features u u The specific method comprises the following steps:
text feature r t Image feature r v And user characteristics r u Composition multimodal features R = [ R ] t ,r v ,r u ] T
Inputting the multi-modal characteristics R into three full connection layers of a cross-modal relationship extractor, and respectively generating a key characteristic matrix K through the following calculation R Query feature matrix Q R Sum value feature matrix V R
Figure BDA0003797777540000021
Wherein, W K ,W Q And W V Respectively a parameter matrix of three fully-connected layers,
and calculating the enhanced text characteristic u by the following formula t Enhancing image feature u v And enhanced user features u u
Figure BDA0003797777540000022
Further, the multi-modal feature fuser calculates multi-modal fused features a N The specific method comprises the following steps:
text feature r t Image feature r v User characteristics r u Enhanced text feature u t Enhancing image feature u v And enhanced user features u u Respectively inputting six full-connection layers to obtain six eigenvectors through the following calculation
Figure BDA0003797777540000031
And
Figure BDA0003797777540000032
Figure BDA0003797777540000033
Figure BDA0003797777540000034
Figure BDA0003797777540000035
Figure BDA0003797777540000036
Figure BDA0003797777540000037
Figure BDA0003797777540000038
wherein the content of the first and second substances,
Figure BDA0003797777540000039
and
Figure BDA00037977775400000310
respectively a parameter matrix of six fully-connected layers,
the multi-modal feature fuser receives the six feature vectors
Figure BDA00037977775400000311
And
Figure BDA00037977775400000312
and dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and finally obtaining the multi-modal fusion characteristic a after multiple iterations N
Further, the classifier receives the multi-modal fused feature a N The specific method for outputting the prediction result comprises the following steps:
the classifier obtains the prediction probability by the following calculation
Figure BDA00037977775400000313
Figure BDA00037977775400000314
Wherein, W p1 And W p2 Is a learnable parameter matrix, b p1 、b p1 Is an offset term, sigmoid and LeakyReLU are activation functions.
Further, the model is optimized by minimizing the cross entropy, the loss function is defined as follows,
Figure BDA00037977775400000315
wherein, Θ represents all learnable parameters of the whole neural network, y represents the label of the information to be detected, rumor is 1, and nonrumor is 0.
The method has the advantages that the modal dynamic feature fusion and cross-modal relationship extraction-based information detection method realizes rumor detection with higher precision by constructing the cross-modal relationship and the dynamic feature fusion.
Drawings
FIG. 1 is a schematic diagram of a prediction model DFCM of the present invention;
FIG. 2 is a schematic diagram of an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction according to the present invention;
FIG. 3 is a schematic diagram of a text feature extractor of the present invention;
FIG. 4 is a schematic diagram of an image feature extractor of the present invention;
FIG. 5 is a schematic diagram of a cross-modal relationship extractor of the present invention;
FIG. 6 is a schematic diagram of the multi-modal feature fuser of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
FIG. 1 shows a prediction model DFC disclosed in the present applicationAnd M, which comprises a multi-modal feature extractor, a cross-modal relationship extractor, a multi-modal feature fusion device and a classifier. Fig. 2 shows an information detection method based on modal dynamic feature fusion and cross-modal relationship extraction, which is implemented based on a prediction model DFCM. Specifically, the information detection method based on modal dynamic feature fusion and cross-modal relationship extraction comprises the following steps: s1: the multi-modal feature extractor receives information to be detected including text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively t Image feature r v And user characteristics r u . S2: cross-modal relationship extractor receives text features r t Image feature r v And user characteristics r u Establishing the association among the modals, and aligning the text features r according to the association among the modals t Image feature r v And user characteristics r u Updating to obtain enhanced text features u t Enhancing image features u v And enhanced user features u u . S3: the multimodal feature fuser receives text features r t Image feature r v User characteristics r u Enhanced text feature u t Enhancing image feature u v And enhanced user features u u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining the multi-modal fusion characteristic a after multiple iterations N . S4: the classifier receives a multi-modal fused feature a N And outputs the prediction result. Through the steps, the modal dynamic feature fusion and cross-modal relationship extraction-based information detection method achieves rumor detection with higher precision by constructing the cross-modal relationship and dynamic feature fusion. The above steps are specifically described below.
For step S1: the multi-modal feature extractor receives information to be detected including text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively t Image feature r v And user characteristics r u
The prediction model DFCM is mainly used for predicting whether a post d in a forum or a microblog is a rumor. Therefore, the information to be detected is a post d, and the post d specifically includes text information d.text, image information d.image and user information d.user. First, the multimodal feature extractor extracts feature information from the post d.
In particular, the multimodal feature extractor includes a text feature extractor, an image feature extractor, and a user feature extractor.
As shown in fig. 3, the text feature extractor includes a BERT model and a fully connected layer.
The text feature extractor extracts a text feature r from the text information t The specific method comprises the following steps:
text information d.text is one-of-N encoded and expanded, and the length of the sentence is expanded to 512. This length is the input length limit of the BERT model, resulting in the code vector E. E = [ E = [CLS] ,e 1 ,...,e |text| ,e [SEP] ,...,e 510 ]Where e represents a word, | text | represents the length of the input text, [ CLS [ ]]Is a sentence start identifier, [ SEP]Is a sentence end identifier, [ SEP]Followed by an extension portion.
Text feature r for effectively extracting text information d t The present application employs a pre-trained BERT model. The BERT model is a multi-layer bi-directional transform encoder. Inputting the encoding vector E into a BERT model to obtain an output matrix B = [ B = [) [CLS] ,b 1 ,...,b n ,b |text| ,...,b 510 ] T Wherein b is [CLS] Representing all semantic information in the text information.
Figure BDA0003797777540000051
Wherein d is B Is the output dimension of the bi-directional transcoder model.
B is to [CLS] Inputting the full connection layer of the text feature extractor to obtain the text feature r by the following calculation t
r t =W tf ·b [CLS]
Wherein, W tf Representing text feature extractorThe weight matrix of the connection layer.
Figure BDA0003797777540000052
d m To hide layer dimensions, the features
Figure BDA0003797777540000053
As shown in fig. 4, the image feature extractor contains a VGG19 network and a full connectivity layer. Image characteristic r of image information d.image is extracted effectively v According to the method, the pre-trained VGG19 network is adopted to firstly extract the object information in the picture, and a full connection layer (visual-fc) is added to the last layer of the VGG19 network, so that on one hand, the size of the image feature can be adjusted, the image feature dimension and the text feature dimension are unified, and preparation is made for multi-modal feature fusion. Since VGG19 was not re-fitted during training, the fully connected layers could further extract features in the pictures relevant to rumor detection.
The image feature extractor extracts image features r from the image information v The specific method comprises the following steps:
inputting image information d.image into VGG19 network to obtain image feature representation r VGG
Figure BDA0003797777540000054
d v Is the output dimension of the VGG19 network.
Characterizing an image r VGG The image characteristic r is obtained by performing the following calculation on the full connection layer of the input image characteristic extractor v
Figure BDA0003797777540000055
r v =W vf ·r VGG
Wherein, W vf A weight matrix representing the fully connected layers of the image feature extractor.
Figure BDA0003797777540000056
Wherein d is m Is the hidden layer dimension.
In the application, the user feature extractor extracts the user features of the user information by adopting a method of combining manual features and a deep learning model. The manual characteristics of the user information d.user are shown in table 1.
The user feature extractor includes a fully connected layer.
The user feature extractor extracts user features r from the user information u The specific method comprises the following steps:
user information d.user is coded through manual characteristics to obtain manual characteristics r raw
Figure BDA0003797777540000057
Wherein d is u Is a manual feature dimension.
Will encode good manual characteristics r raw Inputting the full connection layer of the user characteristic extractor to obtain the user characteristic r by the following calculation u
Figure BDA0003797777540000058
r u =W uf ·r raw
Wherein, W uf The weight matrix for the fully-connected layer of the user feature extractor,
Figure BDA0003797777540000061
for step S2: the cross-modal relationship extractor receives a text feature r t Image feature r v And user characteristics r u Establishing the association among the modalities, and aligning the text characteristics r according to the association among the modalities t Image feature r v And user characteristics r u Updating to obtain enhanced text features u t Enhancing image features u v And enhanced user features u u
As shown in fig. 5, the cross-modal relationship extractor of the present application comprises three fully connected layers and one crossstacking function module.
Cross-modal relationship extractor pair text features r t Image feature r v And use ofCharacteristic of house r u Updating to obtain enhanced text features u t Enhancing image features u v And enhanced user features u u The specific method comprises the following steps:
text feature r t Image feature r v And user characteristics r u Composition multimodal features R = [ R ] t ,r v ,r u ] T
Inputting the multi-modal characteristics R into three full-connection layers of a cross-modal relationship extractor, and respectively generating a key characteristic matrix K through the following calculation R Query feature matrix Q R Sum value feature matrix V R
Figure BDA0003797777540000062
Wherein, W K ,W Q And W V Parameter matrices, K, of three fully-connected layers, respectively R The matrix being used to match the characteristics of other modalities, Q R The matrix is used to wait for the matching of other modal characteristics, V R The matrix serves as the values to be summed.
Then, a cross-modal relationship between the modalities is established, and a key feature matrix K is matched R And query feature matrix Q R The similarity matrix between different modalities is calculated by scaling the dot product, and the calculation process is as follows.
Figure BDA0003797777540000063
d m Is W K Plays a role of scaling. For simplicity of derivation, the softmax and scaling functions in the above equation are omitted and the equation can be extended as follows:
Figure BDA0003797777540000064
for the above formula, the text feature r is used t For illustration, first, a text feature r is described t Will pass throughQuery vector q t Calculating a text feature r t Similarity with all modal characteristics, then weighting and summing the similarity as weight, and finally calculating to obtain updated enhanced text characteristics u t . In the process, the updating process of other modes is similar, namely, a cross-mode relation is established.
Finally, the enhanced text feature u is obtained t Enhancing image features u v And enhanced user features u u The result can be expressed as U = [ U ] t ,u v ,u u ] T Wherein
Figure BDA0003797777540000065
For step S3: the multimodal feature fuser receives text features r t Image feature r v User characteristics r u Enhanced text feature u t Enhancing image feature u v And enhanced user features u u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining the multi-modal fusion characteristic a after multiple iterations N
As shown in FIG. 6, the multi-modal feature fuser calculates multi-modal fused features a N The specific method comprises the following steps:
text feature r t Image feature r v User characteristics r u Enhanced text feature u t Enhancing image feature u v And enhanced user features u u Respectively inputting six full-connection layers, and obtaining six eigenvectors through the following calculation
Figure BDA0003797777540000071
And
Figure BDA0003797777540000072
Figure BDA0003797777540000073
Figure BDA0003797777540000074
Figure BDA0003797777540000075
Figure BDA0003797777540000076
Figure BDA0003797777540000077
Figure BDA0003797777540000078
wherein the content of the first and second substances,
Figure BDA0003797777540000079
and
Figure BDA00037977775400000710
the parameter matrixes are six full-connection layers respectively, and the multi-mode feature fusion device receives the six feature vectors
Figure BDA00037977775400000711
And
Figure BDA00037977775400000712
and dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and finally obtaining the multi-modal fusion characteristic a after multiple iterations N
The dynamic routing mechanism of the multi-modal feature fusion device of the present application adopts the dynamic routing method proposed by Sabour.
Dynamic routing is a mechanism for outputting vectors from capsules, and powerful dynamic routing mechanisms can ensure that the output of a capsule is sent to the appropriate parent in the upper layer.
In a fully connected neural network, neurons can be calculated by the following formula:
Figure BDA00037977775400000713
W ij the resulting parameter matrix is trained using a back propagation algorithm through a global function. Iterative dynamic routing provides an alternative method of calculating how a capsule is activated by using attributes of local features. Such a method allows better and simpler combination of the inputs to form an analytic tree with lower risk to the wind. In dynamic routing the output is routed to all possible parent nodes but scaled down by adding to a 1 coupling factor. For each possible parent node, each parent node is multiplied by a weight matrix in each iteration round to compute a "prediction vector". If the prediction vector has a large scalar product with a possible parent output, there is a top-down feedback, increasing the coupling coefficient of the node and decreasing the coupling coefficients of the other nodes. This type of "protocol routing" can be much more efficient than the very primitive form of routing that is implemented for maximum pooling.
Specifically, the process of dynamic routing iteration of the present application is as follows,
Figure BDA0003797777540000081
where N is the iteration round, b r The "square" function reduces the short vectors to a length close to zero and the long vectors to a length slightly less than 1 without changing the vector direction. In each iteration turn of the dynamic routing, a prediction vector 'a' is obtained by weighted summation and a 'square' function r If the feature vector j is input i And "prediction vector" a r With a larger dot product (i.e., more similar), the next iteration round increases the coupling coefficient c of the feature vector i Reducing the coupling coefficient of other eigenvectorsAnd (4) counting. Eventually, the output of the dynamic routing will tend to be the most prominent modal features, and other modal features are also fused. The result is the feature fusion result after the iteration output of the N rounds of dynamic routing
Figure BDA0003797777540000082
For step S4: the classifier receives a multi-modal fused feature a N And outputs the prediction result.
Further, the feature a is fused N Inputting the predicted post d into the classifier to obtain the predicted result of the post d. The classifier outputs a prediction probability that a post is rumor
Figure BDA0003797777540000083
Figure BDA0003797777540000084
Wherein, W p1 And W p2 Is a learnable parameter matrix.
Figure BDA0003797777540000085
d p Is the dimension of the classifier. b is a mixture of p1 、b p1 Is an offset term and sigmoid and LeakyReLU are activation functions.
Further, the model is optimized by minimizing the cross entropy, the loss function is defined as,
Figure BDA0003797777540000091
wherein, Θ represents all learnable parameters of the whole neural network, y represents the label of the information to be detected, rumor is 1, and nonrumor is 0.
The microblog data sets were tested by different prediction models, and the results are shown in table 2. The accuracy, precision, recall and F1 score of the model DFCM proposed herein are 86.42%, 84.26%, 90.61% and 87.32% respectively, which are superior to other models.
Table 2 experimental results of different models on microblog data sets
Figure BDA0003797777540000092
The twitter data sets were subjected to experiments with different predictive models and the results are shown in table 3. The accuracy, precision, recall and F1 score of the model DFCM proposed herein are 88.64%, 91.93%, 91.01% and 91.47%, respectively, which are also superior to other models.
Table 3 experimental results of different models on twitter data sets
Figure BDA0003797777540000093
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalents or equivalent changes fall within the protection scope of the present invention.

Claims (10)

1. An information detection method based on modal dynamic feature fusion and cross-modal relationship extraction is characterized by comprising the following steps:
the multi-modal feature extractor receives information to be detected comprising text information, image information and user information and extracts text features r from the text information, the image information and the user information respectively t Image feature r v And user characteristics r u
Receiving the text feature r by the cross-modal relation extractor t The image feature r v And said user characteristic r u Establishing the association among the modes, and aligning the text features r according to the association among the modes t The image feature r v And said user characteristic r u Updating to obtain enhanced text features u t Enhancing image features u v And increaseStrong user characteristics u u
A multi-modal feature fuser receives the text features r t The image feature r v The user characteristic r u The enhanced text feature u t The enhanced image feature u v And said enhanced user features u u And dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and obtaining the multi-modal fusion characteristic a after multiple iterations N
The classifier receives the multi-modal fusion feature a N And outputs the prediction result.
2. The method for detecting information based on modal dynamic feature fusion and cross-modal relationship extraction according to claim 1,
the multi-modal feature extractor includes a text feature extractor, an image feature extractor, and a user feature extractor.
3. The method for detecting information based on modal dynamic feature fusion and cross-modal relationship extraction according to claim 2,
the text feature extractor comprises a BERT model and a full connection layer;
the text feature extractor extracts the text feature r from the text information t The specific method comprises the following steps:
performing one-of-N coding and expansion on the text information, expanding the length of a sentence to 512 to obtain a coding vector E, and inputting the coding vector E into a BERT model to obtain an output matrix B = [ B ] [CLS] ,b 1 ,...,b n ,b |text| ,...,b 510 ] T Wherein b is [CLS] Representing all semantic information in the text information;
b is to be [CLS] Inputting the full connection layer of the text feature extractor to obtain the text feature r by the following calculation t
r t =W tf ·b [CLS]
Wherein,W tf A weight matrix representing a fully connected layer of the text feature extractor.
4. The method according to claim 3, wherein the information detection method based on modal dynamic feature fusion and cross-modal relationship extraction,
the image feature extractor comprises a VGG19 network and a fully connected layer;
the image feature extractor extracts the image feature r from the image information v The specific method comprises the following steps:
inputting the image information into a VGG19 network to obtain an image feature representation r VGG
Representing the image features r VGG Inputting the image feature r to the full-connection layer of the image feature extractor to obtain the image feature r by the following calculation v
r v =W vf ·r VGG
Wherein, W vf A weight matrix representing a fully connected layer of the image feature extractor.
5. The method according to claim 4, wherein the information detection method based on modal dynamic feature fusion and cross-modal relationship extraction,
the user feature extractor extracts the user features of the user information by adopting a method of combining manual features and a deep learning model.
6. The method according to claim 5, wherein the information extraction method based on modal dynamic feature fusion and cross-modal relationship extraction,
the user feature extractor comprises a full connectivity layer;
the user feature extractor extracts the user feature r from the user information u The specific method comprises the following steps:
coding the user information through manual characteristics to obtain manual characteristics r raw
The manual features r to be coded raw Inputting the full connection layer of the user feature extractor to perform the following calculation to obtain the user feature r u
r u =W uf ·r raw
Wherein, W uf A weight matrix for a fully connected layer of the user feature extractor.
7. The method according to claim 6, wherein the information extracted by the modal-based dynamic feature fusion and cross-modal relationship is extracted,
the cross-modal relationship extractor comprises three fully connected layers and a cross filter function module;
the cross-modal relationship extractor pair the text features r t The image feature r v And said user characteristic r u Updating to obtain enhanced text features u t Enhancing image feature u v And enhanced user features u u The specific method comprises the following steps:
the text characteristic r is combined t The image feature r v And said user characteristic r u Composition multimodal features R = [ R ] t ,r v ,r u ] T
Inputting the multi-modal characteristics R into three full-connection layers of the cross-modal relationship extractor, and respectively generating a key characteristic matrix K through the following calculation R Query feature matrix Q R Sum value feature matrix V R
Figure FDA0003797777530000021
Wherein, W K ,W Q And W V Respectively a parameter matrix of three fully-connected layers,
and calculating the enhanced text characteristic u by the following formula t Enhancing image features u v And enhanced user features u u
Figure FDA0003797777530000022
8. The method according to claim 7, wherein the information extracted by the modal-based dynamic feature fusion and cross-modal relationship is extracted,
the multi-modal feature fuser calculates the multi-modal fused features a N The specific method comprises the following steps:
the text characteristic r is combined t The image feature r v The user characteristic r u The enhanced text feature u t The enhanced image feature u v And said enhanced user features u u Respectively inputting six full-connection layers to obtain six eigenvectors through the following calculation
Figure FDA0003797777530000031
And
Figure FDA0003797777530000032
Figure FDA0003797777530000033
Figure FDA0003797777530000034
Figure FDA0003797777530000035
Figure FDA0003797777530000036
Figure FDA0003797777530000037
Figure FDA0003797777530000038
wherein the content of the first and second substances,
Figure FDA0003797777530000039
and
Figure FDA00037977775300000310
respectively, a parameter matrix of six fully-connected layers,
the multi-modal feature fuser receives the six feature vectors
Figure FDA00037977775300000311
And
Figure FDA00037977775300000312
and dynamically allocating the weight coefficient of each modal characteristic through a dynamic routing mechanism, and finally obtaining the multi-modal fusion characteristic a after multiple iterations N
9. The method for information detection based on modal dynamic feature fusion and cross-modal relationship extraction according to claim 8,
the classifier receives the multi-modal fusion feature a N The specific method for outputting the prediction result comprises the following steps:
the classifier obtains the prediction probability by the following calculation
Figure FDA00037977775300000313
Figure FDA00037977775300000314
Wherein, W p1 And W p2 Is a learnable parameter matrix, b p1 、b p1 Is an offset term, sigmoid and LeakyReLU are activation functions.
10. The method according to claim 9, wherein the information extraction method based on modal dynamic feature fusion and cross-modal relationship extraction,
the model is optimized by minimizing the cross entropy, the loss function is defined as,
Figure FDA00037977775300000315
wherein Θ represents all learnable parameters of the whole neural network, y represents the label of the information to be detected, and the rumor is 1 and the nonrumor is 0.
CN202210974704.3A 2022-08-15 2022-08-15 Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction Pending CN115563573A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210974704.3A CN115563573A (en) 2022-08-15 2022-08-15 Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210974704.3A CN115563573A (en) 2022-08-15 2022-08-15 Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction

Publications (1)

Publication Number Publication Date
CN115563573A true CN115563573A (en) 2023-01-03

Family

ID=84739538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210974704.3A Pending CN115563573A (en) 2022-08-15 2022-08-15 Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction

Country Status (1)

Country Link
CN (1) CN115563573A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117876941A (en) * 2024-03-08 2024-04-12 杭州阿里云飞天信息技术有限公司 Target multi-mode model system, construction method, video processing model training method and video processing method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117876941A (en) * 2024-03-08 2024-04-12 杭州阿里云飞天信息技术有限公司 Target multi-mode model system, construction method, video processing model training method and video processing method

Similar Documents

Publication Publication Date Title
CN111291212B (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
CN111985245B (en) Relationship extraction method and system based on attention cycle gating graph convolution network
CN110083705B (en) Multi-hop attention depth model, method, storage medium and terminal for target emotion classification
CN110222140B (en) Cross-modal retrieval method based on counterstudy and asymmetric hash
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN111897913B (en) Semantic tree enhancement based cross-modal retrieval method for searching video from complex text
CN111309971B (en) Multi-level coding-based text-to-video cross-modal retrieval method
WO2020107878A1 (en) Method and apparatus for generating text summary, computer device and storage medium
CN111274398B (en) Method and system for analyzing comment emotion of aspect-level user product
WO2023280065A1 (en) Image reconstruction method and apparatus for cross-modal communication system
CN111061843A (en) Knowledge graph guided false news detection method
CN111274375B (en) Multi-turn dialogue method and system based on bidirectional GRU network
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN111159485A (en) Tail entity linking method, device, server and storage medium
CN111444367A (en) Image title generation method based on global and local attention mechanism
CN108256968A (en) A kind of electric business platform commodity comment of experts generation method
CN113806609A (en) Multi-modal emotion analysis method based on MIT and FSM
Mao et al. Chinese sign language recognition with sequence to sequence learning
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
CN112487200A (en) Improved deep recommendation method containing multi-side information and multi-task learning
CN114942998B (en) Knowledge graph neighborhood structure sparse entity alignment method integrating multi-source data
CN115563573A (en) Information detection method based on modal dynamic feature fusion and cross-modal relationship extraction
JP7181999B2 (en) SEARCH METHOD AND SEARCH DEVICE, STORAGE MEDIUM
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination