CN117521012A - False information detection method based on multi-mode context hierarchical step alignment - Google Patents

False information detection method based on multi-mode context hierarchical step alignment Download PDF

Info

Publication number
CN117521012A
CN117521012A CN202311569509.3A CN202311569509A CN117521012A CN 117521012 A CN117521012 A CN 117521012A CN 202311569509 A CN202311569509 A CN 202311569509A CN 117521012 A CN117521012 A CN 117521012A
Authority
CN
China
Prior art keywords
text
image
feature
vocabulary
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311569509.3A
Other languages
Chinese (zh)
Inventor
潘祯祥
毛莺池
熊力
陈秉睿
曹一凡
戚荣志
禹跃美
贾璐瑶
祖立辉
吴波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Huaneng Lancang River Hydropower Co Ltd
Original Assignee
Hohai University HHU
Huaneng Lancang River Hydropower Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU, Huaneng Lancang River Hydropower Co Ltd filed Critical Hohai University HHU
Priority to CN202311569509.3A priority Critical patent/CN117521012A/en
Publication of CN117521012A publication Critical patent/CN117521012A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a false information detection method and a system based on multi-mode context layering step-by-step alignment, which acquire images and texts in social media blogs; establishing a local alignment module with bidirectional cross modal attention, and deducing a matching relation of fragment levels; the method comprises the steps of establishing a global context fusion entity feature extraction module, helping local objects learn more context semantics by utilizing a multi-head self-attention mechanism, and enhancing context representation of image and text features; designing an adaptive weight filtering module to align image-text entity characteristics, integrating all weighted characteristics to realize global matching according to the similarity of two modal enhancement characteristics, and inhibiting overall semantic deviation; and inputting the overall similarity of the two different levels of fusion entity characteristics into a binary classifier to obtain a false information detection result. By integrating the context information, the invention can reserve complete semantic information of texts and images and improve the accuracy of false information detection.

Description

False information detection method based on multi-mode context hierarchical step alignment
Technical Field
The invention relates to a false information detection method and system based on multi-mode context layering step-by-step alignment, in particular to a detection method and system for matching of images and text information in social media blogs, and belongs to the technical field of false news detection.
Background
With the increasing importance of social media, many bloggers have resorted to exaggerating their words to increase the transmissibility of their blogs. However, this type of blogging tends to be fraught with misleading information. The prior art detects false information in social media by using various algorithms, such as keyword detection, user behavior analysis, and machine learning models, but these methods have certain drawbacks. Firstly, the keyword detection method may generate false alarm or missing report due to different contexts; user behavior analysis is limited by data acquisition and is subject to privacy protection policies; although machine learning models have proven to be effective methods in recent years, traditional machine learning methods tend to not handle fine inconsistencies between images and text information well, which results in limited ability to detect fine-grained deceptive information.
For example, one example of a PHEME dataset shows a social media blog that distorts the fact information. In this example, the image shows a scenario of "a woman holding a child", and the accompanying text is described as "this woman has born 14 children". It can be found that there is a significant inconsistency in the two different forms of information in the presentation of the same subject "child", and further that from the exaggerated textual description it can be inferred that this piece of blog contains false information. Thus, performing a fine image-text context matching work is critical, so that proper alignment of information can be ensured, and thus the accuracy of identifying false information is improved.
Disclosure of Invention
The invention aims to: aiming at the two problems that in the existing fine granularity matching method, certain information words in text cannot strictly correspond to a certain area in an image, and the context information is ignored and the matching network excessively emphasizes that fine granularity alignment is misled by the overall understanding of the image and the alignment of the context is ignored, the invention provides a false information detection method and a false information detection system based on multi-mode context hierarchical stepwise alignment. Firstly, respectively and independently extracting images and text features in a social media blog, extracting image region features by a Faster R-CNN network based on ResNet-101, and extracting text vocabulary features by a Bi-GRU network. And secondly, calculating the similarity of the image area and the text area by using a bidirectional cross attention mechanism to strictly correspond to the image area and the text vocabulary, and obtaining a local fragment level matching result between the visual area and the text vocabulary. And then, in order to avoid the situation that the matching network excessively emphasizes the fine granularity and ignores the alignment of the contexts, the basic characteristics of the image and the text are enhanced by utilizing a multi-head self-attention mechanism and combining the context information, the enhanced characteristics of the image and the text are obtained, and the global context matching result of the image and the text is obtained through an adaptive weight filtering module. Finally, the full connection layer with softmax function is used to sum the result vectors of the local and global stages and classify the result vectors into two results of real information or false information.
The technical scheme is as follows: a false information detection method based on multi-mode context hierarchical step-by-step alignment comprises the following steps:
(1) Extracting basic features of images and texts, and respectively extracting features of images and texts in social media blogs by using a target detection model and a language characterization model to respectively obtain basic feature vectors of visual areas and text words;
(2) Extracting image and text weighted characteristics, namely aligning the same segment in different modes for the image and text basic characteristic vectors extracted in the step (1) by utilizing a bidirectional cross attention mechanism, and then taking text vocabulary as a benchmark to obtain an image area basic characteristic vector weighted sum for describing the vocabulary as a weighted characteristic vector of the vocabulary; similarly, taking an image area as a reference to obtain a text vocabulary basic feature vector weighted sum describing the area as a weighted feature vector of the area;
(3) Image-text feature local matching, namely matching the image and text basic feature vector extracted in the step (1) with the image and text weighted feature vector extracted in the step (2) by utilizing similarity to obtain a local fragment level matching result;
(4) Extracting image and text enhancement features, namely exploring the context relation among features by utilizing a multi-head self-attention mechanism for the basic feature vectors of the image and the text extracted in the step (1) to respectively obtain the image and the text enhancement feature vectors fused with the context information;
(5) Matching the image-text enhancement features, and obtaining a global correlation matching result by utilizing a self-adaptive weight filtering module for the image and text enhancement features extracted in the step (4);
(6) And (3) detecting social media blogs, namely inputting the local segment level matching result obtained in the step (3) and the global relevance matching result obtained in the step (5) into a binary classifier, projecting the sum vector of two levels of result vectors to two target spaces of real information and false information by using a full-connection layer with a softmax function in the binary classifier, and obtaining the detection result of the social media blogs.
In the step (1), an image X and a text Y are defined, first, the fast R-CNN based on Resnet-101 is used for extracting the basic characteristics of an image area, and then the Bi-GRU network is used for extracting the basic characteristics of text vocabulary.
Further, in the step (1), the specific steps of extracting the image region feature and the text vocabulary feature are as follows:
(1.1) extraction of basic features of an image area: for one image X, selecting a Faster R-CNN to extract image region characteristics, and selecting the first m regions to represent X according to the scores of the image regions. Then a feature vector f is extracted for each image region using a pre-trained ResNet-101 i ,i∈[1,m]. Finally, the full connection layer is utilized to convert the feature vector into 1024-dimensional feature vectors, and finally the feature vectors are expressed as a group of regional features { x } 1 ,x 2 ,...,x m X E R as an essential feature of an image region d ×m . As shown in formula (1):
x i =W x f i +b x (1)
(1.2) extraction of basic characteristics of text vocabulary: for a sentence of text Y, each vocabulary is first encoded into one-hot vector g t ,t∈[1,n]And through a parameter matrix W g And offset vector b g Embedded in 300-dimensional vector o t As shown in formula (2):
o t =W g g t +b g (2)
next, bi-GRU is used to integrate context information in sentences captured from both front and back directions into text embedding, and then the final word feature y is obtained by averaging t As basic characteristics of text vocabulary, the following formula (3) is shown:
in the step (2), the basic feature vector of the image area extracted in the step (1.1) and the basic feature vector of the text vocabulary extracted in the step (1.2) are utilized to obtain multi-mode weighted feature vectors of the image area and the text vocabulary by a bidirectional cross attention mechanism.
Further, in the step (2), the specific steps of extracting the weighted feature vectors of the image area and the text vocabulary are as follows:
(2.1) calculating a similarity matrix of the image area and the text vocabulary: as shown in formula (4):
A=(W x X)(W y Y) T (4)
wherein W is x And W is y Is a parameter that can be learned, x= { X 1 ,x 2 ,...,x m Sum y= { Y 1 ,y 2 ,...,y n The basic features of the image area and the basic features of the text vocabulary extracted in the step (1.1) and the step (1.2) are respectively, A epsilon R m×n Is the similarity matrix of image area-text vocabulary, A it Meaning semantic similarity of the ith region and the t vocabulary.
(2.2) image region weighted feature extraction: taking a certain image area as a reference, distributing weights to n vocabularies, and then obtaining an image area x based on the n vocabularies by weighting and combining the n vocabularies i Text lexical feature y i * The process is shown as a formula (5):
where lambda is the temperature parameter as a function of softmax.
(2.3) text vocabulary weighted feature extraction: similarly, a text-based vocabulary y is obtained by taking a certain text vocabulary as a reference t Corresponding image region feature x t * The process is shown as a formula (6):
in the step (3), the image and text basic feature vectors extracted in the step (1) and the image and text weighted feature vectors extracted in the step (2) are subjected to segment level matching, and the segment level matching scores of the complete image and the complete text are obtained on the basis of the segment level matching scores.
Further, in the step (3), the specific steps of calculating the fragment level similarity are as follows:
(3.1) calculating a fragment-level matching score for the image region correlation: for image region x i And (2.2) the region extracted corresponds to the text vocabulary feature y i * Calculating to obtain similarityAs shown in formula (7):
(3.2) calculating a text vocabulary related segment level matching score: similarly, for the text vocabulary y t And (2.3) the vocabulary corresponding image area feature x extracted in t * Calculating to obtain similarityAs shown in formula (8):
(3.3) calculating a segment-level matching score S of the complete image and the complete text segment (X, Y) as shown in formula (9):
where η is a superparameter for balancing the region-dependent segment level matching score S x (x i ,y i * ) Vocabulary-related segment-level matching score S y (y t ,x t * ) Is a contribution of (a).
In the step (4), the basic feature vectors of the image and the text extracted in the step (1) are explored by utilizing a multi-head self-attention mechanism to explore the context relation among the features, and the enhanced feature vectors of the image and the text fused with the context information are obtained, and the specific steps are as follows:
(4.1) image enhancement feature extraction: extracting a set of region features x= { X in step (1.1) for an image X 1 ,x 2 ,...,x m A multi-headed self-attention mechanism is used to explore the contextual relationship between regional features. Specifically, query Q is first defined X Key K X Sum value Val X As shown in formula (10):
Q X =XW Q ,K X =XW K ,Val X =XW Val (10)
wherein W is Q ,W K ,W Val Is a learnable feature mapping matrix. Further, the value Val is calculated by the formula (11) X Is a weighted sum of:
wherein d is k Is a scale factor. Weight matrix Q X K X T Is a matrix that encodes the relationship of each visual feature to all other features by inner lamination. The relationships between the different objects are then encoded with equation (12):
where j represents the j-th head,w in (2) represents a learnable feature mapping matrix, superscript Q X This matrix is represented for image queries, where Q represents the query, x represents for the image, and subscript j represents the j-th header. Thus (S)>Refers to the processing of image input in the j th head in the multi-head attention mechanismA matrix of query weights on the table. Similarly, the number of the devices to be used in the system,and->Representing a key weight matrix and a value weight matrix, respectively, dedicated to processing the image input at the j-th head.
Finally, the i-th original visual zone feature x i Is converted into enhanced features x by means of a combination of global objects and local guided objects ip As shown in formula (13):
wherein W is p Is a learnable weight matrix, h is the number of heads. Then for an image the set x= { X based on region features 1 ,x 2 ,...,x m The enhanced feature of } can be represented as X reinforcement ={x 1p ,x 2p ,...,x mp }。
(4.2) text enhancement feature extraction: similarly, the text feature y= { Y extracted in step (1.2) for a piece of text Y 1 ,y 2 ,...,y n Performing the same procedure as in step (4.1) to obtain text enhancement feature Y reinforcement ={y 1p ,y 2p ,...,y np }。
(4.3) weighted enhancement feature extraction: similarly, for the weighted text feature X extracted in step (2) * ={x 1 * ,x 2 * ,...,x n * Sum image feature Y * ={y 1 * ,y 2 * ,...,y m * The same process as in step (4.1) is performed to obtain weighted features after enhancement, denoted as X respectively reinforcement * ={x 1p * ,x 2p * ,...,x np * },Y reinforcement * ={y 1p * ,y 2p * ,...,y mp * }。
In the step (5), the image and text enhancement features extracted in the step (4) are filtered by an adaptive weight filtering module to obtain a global correlation matching result, and the specific steps are as follows:
(5.1) calculating an image-to-text direction global matching score: combining the enhanced basic feature pair (X) obtained in step (4) reinforcement ,Y reinforcement ) And enhancing the weighted feature pair (Y reinforcement * ,X reinforcement * ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction x→y As shown in formula (14):
wherein w is x =cos(x lp ,y lp * ) For adaptive weights, the enhanced base feature x is represented lp Enhancement weighting feature y lp * Is of importance.
(5.2) calculating a text-to-image direction global matching score: similarly, the basic feature pair (X reinforcement ,Y reinforcement ) And enhancing the weighted feature pair (Y reinforcement * ,X reinforcement * ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction y→x As shown in formula (15):
similar to (5.1), w y =cos(y lp ,x lp * ) For adaptive weights, the representation isEnhancement of basic feature y lp Enhancement weighting feature x lp * Is of importance.
(5.3) calculating a global matching score S of the complete image and the complete text global (X, Y) as shown in formula (16):
S global (X,Y)=s x→y +s y→x (16)
in the step (6), the local segment level matching result obtained in the step (3) and the global matching result obtained in the step (5) are input into a binary classifier, and the sum vector of the result vectors of two levels is projected to two target spaces of real information and false information by using a full-connection layer with a softmax function in the binary classifier, so that a detection result of the social media blog is obtained, wherein the specific steps are as follows:
(6.1) after obtaining the local region-vocabulary matching result and the global matching result, using a fully connected layer with a softmax function to match S segment (X, Y) and S global (X, Y) projects into the target space of only two categories (real or fake) and gets its probability distribution as shown in equation (17):
wherein p= [ p ] 0 ,p 1 ]Representing a predictive probability vector, p 0 And p 1 The probabilities of the prediction results being 0 (real) and 1 (false), respectively. W is the weight matrix and b is the bias term.
(6.2) for each blog, the goal is to minimize a binary cross entropy loss function to distinguish whether the blog is real information or false information. The loss function is shown in equation (18):
L p =-[ylnp 0 +(1-y)lnp 1 ] (18)
where y e {0,1} is the authenticity label for each blog, y=0 indicates that the blog is authentic, and y=1 indicates that the blog contains false information.
A false information detection system based on multi-mode context layering step-by-step alignment comprises an image region feature extraction module, a text vocabulary feature extraction module, a local alignment module, a global context alignment module and a social media blog detection module;
the image region feature extraction module firstly detects a significant region by using Faster R-CNN, then extracts a feature vector for each image region by using pretrained ResNet-101, and finally converts the feature vector into a feature vector by using a full connection layer to obtain an image region feature vector;
the text vocabulary feature extraction module firstly codes each vocabulary into one-hot vectors, and embeds the one-hot vectors into the vector o through parameter matrixes and offset vectors t And then the Bi-GRU is used for capturing context information in sentences from front and back directions and integrating the context information into text embedding, so that text vocabulary feature vectors are obtained.
The local alignment module firstly obtains a segment level cross-modal weighted feature vector through a bidirectional cross-attention mechanism, then calculates to obtain a segment level matching score related to a region and a segment level matching score related to a vocabulary, and finally averages all segment level matching scores based on the region and segment level matching scores based on the vocabulary to obtain a final segment level matching score of a complete image and a text.
The global context alignment module firstly acquires the context relation between fragment level features by using a multi-head self-attention mechanism, acquires an enhanced feature vector fused with context information, then calculates global matching scores of two directions from image to text and text to image respectively, and finally takes the sum of the global matching scores of the two directions as the global matching score of the complete image and the text.
The social media blog detection module inputs the segment-level similarity and the global similarity between the images and the texts into a full-connection layer with a softmax function to obtain a detection result that the social media blog is real information or false information.
The implementation process of the system is the same as that of the method, and is not repeated.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of false information detection based on multi-modal context hierarchical step-wise alignment as described above when executing the computer program.
A computer readable storage medium storing a computer program for performing a method of false information detection based on multi-modal context hierarchical step-wise alignment as described above.
The beneficial effects are that: aiming at the problem that the importance of neglecting context information when modeling the multi-modal fine granularity corresponding relation by using the model and the alignment problem that the model cannot capture different modal contexts are solved in false information of an exaggeration manipulation imaginary event. The invention adopts a two-stage strategy to detect false information. The first stage establishes a local alignment module with bi-directional cross-modal attention, and deduces segment-level matching relationships by adding region-vocabulary similarities in different directions. The second stage establishes a global context alignment module, utilizes a multi-head self-attention mechanism to help local objects learn more context semantics, enhances visual and text context representation, and then utilizes an adaptive weight filtering module to enhance the similarity of features according to two modes, realizes global matching by integrating all weighted features, and inhibits overall semantic deviation. Finally, integrating the result vectors of the two different levels into a full-connection-layer classifier with a softmax function, and classifying the blog as two results of real information or false information. The false information detection model obtained by the method can effectively obtain accurate detection results.
Drawings
FIG. 1 is a block diagram of a method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a bi-directional cross-attention mechanism according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.
As shown in fig. 1, the method for detecting false information based on multi-mode context hierarchical step alignment disclosed in the embodiment of the invention specifically comprises the following steps:
(1) Extracting basic features of images and texts: defining an image X and a text Y, firstly extracting basic characteristics of an image area by using a fast R-CNN based on Resnet-101, and then extracting basic characteristics of text words by using a Bi-GRU network. The specific steps of extracting the image area characteristics and the text vocabulary characteristics are as follows:
(1.1) extraction of basic features of an image area: for one image X, selecting Faster R-CNN to extract image region characteristics, and selecting the first m regions to represent X according to the scores of the Faster R-CNN. Then a feature vector f is extracted for each image region using a pre-trained ResNet-101 i ,i∈[1,m]. Finally, the full connection layer is utilized to convert the region feature into 1024-dimensional feature vectors, and finally the 1024-dimensional feature vectors are expressed as a group of region features { x } 1 ,x 2 ,...,x m },X∈R d×m . As shown in formula (1):
x i =W x f i +b x (1)
(1.2) extraction of basic characteristics of text vocabulary: for a sentence of text Y, each vocabulary is first encoded into one-hot vector g t ,t∈[1,n]And through a parameter matrix W g And offset vector b g Embedded in 300-dimensional vector o t As shown in formula (2):
o t =W g g t +b g (2)
next, bi-GRU is used to integrate context information in sentences captured from both front and back directions into text embedding, and then the final word feature y is obtained by averaging t As shown in formula (3):
(2) Extracting weighted features of images and texts: and (2) obtaining the multi-modal weighted feature vectors of the image region and the text vocabulary by utilizing a bi-directional cross attention mechanism by using the basic feature vectors of the image region extracted in the step (1.1) and the basic feature vectors of the text vocabulary extracted in the step (1.2), wherein the bi-directional cross attention mechanism is shown in figure 2. The specific steps of extracting the weighted feature vectors of the image area and the text vocabulary are as follows:
(2.1) calculating a similarity matrix of the image area and the text vocabulary: as shown in formula (4):
A=(W x X)(W y Y) T (4)
wherein W is x And W is y Is a parameter which can be learned, A is E R m×n Is the similarity matrix of image area-text vocabulary, A it Meaning semantic similarity of the ith region and the t vocabulary.
(2.2) image region weighted feature extraction: taking a certain image area as a reference, distributing weights to n vocabularies, and then obtaining an image area x based on the n vocabularies by weighting and combining the n vocabularies i Text lexical feature y i * The process is shown as a formula (5):
where lambda is the temperature parameter as a function of softmax.
(2.3) text vocabulary weighted feature extraction: similarly, a text-based vocabulary y is obtained by taking a certain text vocabulary as a reference t Corresponding image region feature x t * The process is shown as a formula (6):
(3) Image-text feature local matching: and (3) performing segment level matching on the image and text basic feature vectors extracted in the step (1) and the image and text weighted feature vectors extracted in the step (2), and obtaining segment level matching scores for the complete image and the complete text on the basis. The specific steps for calculating the fragment level similarity are as follows:
(3.1) calculating a fragment-level matching score for the image region correlation: for image region x i And (2.2) the region extracted corresponds to the text vocabulary feature y i * Calculating to obtain similarityAs shown in formula (7):
(3.2) calculating a text vocabulary related segment level matching score: similarly, for the text vocabulary y t And (2.3) the vocabulary corresponding image area feature x extracted in t * Calculating to obtain similarityAs shown in formula (8):
(3.3) calculating a segment-level matching score S of the complete image and the complete text segment (X, Y) as shown in formula (9):
where η is a superparameter for balancing the region-dependent segment level matching score S x (x i ,y i * ) Vocabulary-related segment-level matching score S y (y t ,x t * ) Is a contribution of (a).
(4) Extracting image and text enhancement features: the basic feature vectors of the images and the texts extracted in the step (1) are explored by utilizing a multi-head self-attention mechanism to explore the context relation among the features, and the images and the text enhancement feature vectors fused with the context information are obtained, wherein the specific steps are as follows:
(4.1) image enhancement feature extraction: extracting a set of region features x= { X in step (1.1) for an image X 1 ,x 2 ,...,x m A multi-headed self-attention mechanism is used to explore the contextual relationship between regional features. Specifically, query Q is first defined X Key K X Sum value Val X As shown in formula (10):
Q X =XW Q ,K X =XW K ,Val X =XW Val (10)
wherein W is Q ,W K ,W Val Is a learnable feature mapping matrix. Further, the value Val is calculated by the formula (11) X Is a weighted sum of:
wherein d is k Is a scale factor. Weight matrix Q X K X T Is a matrix that encodes the relationship of each visual feature to all other features by inner lamination. The relationships between the different objects are then encoded with equation (12):
where j represents the j-th header. Finally, the i-th original visual zone feature x i Is converted into enhanced features by means of a combination of global objects and local boot objects, e.g. x ip Formula (13):
x ip =Concat(R 1 +R 2 +...+R h )W p +x i (13)
wherein W is p Is a learnable weight matrix, h is the number of heads. Then for an image the set x= { X based on region features 1 ,x 2 ,...,x m The enhanced feature of } can be represented as X reinforcement ={x 1p ,x 2p ,...,x mp }。
(4.2) text enhancement feature extraction: similarly, for a piece of text Y, the text feature extracted in step (1.2), y= { Y 1 ,y 2 ,...,y n The same procedure in the step (4.1) is carried out to obtain text enhancement feature Y reinforcement ={y 1p ,y 2p ,...,y np }。
(4.3) weighted enhancement feature extraction: similarly, for the weighted text feature X extracted in step (2) * ={x 1 * ,x 2 * ,...,x n * Sum image feature Y * ={y 1 * ,y 2 * ,...,y m * The same process as in step (4.1) is performed to obtain weighted features after enhancement, denoted as X respectively reinforcement * ={x 1p * ,x 2p * ,...,x np * },Y reinforcement * ={y 1p * ,y 2p * ,...,y mp * }。
(5) Image-text enhancement feature matching: and (3) obtaining a global correlation matching result by utilizing the image and text enhancement features extracted in the step (4) and utilizing a self-adaptive weight filtering module, wherein the specific steps are as follows:
(5.1) calculating an image-to-text direction global matching score: combining the enhanced basic feature pair (X) obtained in step (4) reinforcement ,Y reinforcement ) And enhancing the weighted feature pair (Y reinforcement * ,X reinforcement * ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction x→y As shown in formula (14):
wherein w is x =cos(x lp ,y lp * ) For adaptive weights, the enhanced base feature x is represented lp Enhancement weighting feature y lp * Is of importance.
(5.2) calculating a text-to-image direction global matching score: similarly, the basic feature pair (X reinforcement ,Y reinforcement ) And enhancing the weighted feature pair (Y reinforcement * ,X reinforcement * ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction y→x As shown in formula (15):
(5.3) calculating a global matching score S of the complete image and the complete text global (X, Y) as shown in formula (16):
S global (X,Y)=s x→y +s y→x (16)
(6) Social media blog detection: inputting the local segment level matching result obtained in the step (3) and the global matching result obtained in the step (5) into a binary classifier, projecting the sum vector of the result vectors of two levels to a target space of real information and false information by using a full-connection layer with a softmax function in the binary classifier, and obtaining a detection result of the social media blog, wherein the detection result comprises the following specific steps:
(6.1) after obtaining the local region-vocabulary matching result and the global matching result, using a fully connected layer with a softmax function to match S segment (X, Y) and S global (X, Y) projects into the target space of only two categories (real or fake) and gets its probability distribution as shown in equation (17):
wherein p= [ p ] 0 ,p 1 ]Representing a predictive probability vector, p 0 And p 1 The probabilities of the prediction results being 0 (real) and 1 (false), respectively. W is the weight matrix and b is the bias term.
(6.2) for each blog, the goal is to minimize the binary cross entropy loss function, as shown in equation (18):
L p =-[ylogp 0 +(1-y)logp 1 ] (18)
where y ε {0,1} represents the true value.
A false information detection system based on multi-mode context layering step-by-step alignment comprises an image region feature extraction module, a text vocabulary feature extraction module, a local alignment module, a global context alignment module and a social media blog detection module;
the image region feature extraction module firstly detects a significant region by using a fast R-CNN, then extracts a feature vector for each image region by using a pretrained ResNet-101, and finally converts the feature vector into 1024-dimensional feature vectors by using a full connection layer to obtain the feature vector of the image region;
and the text vocabulary feature extraction module is used for firstly encoding each vocabulary into one-hot vectors, embedding the one-hot vectors into 300-dimensional vectors through parameter matrixes and offset vectors, and integrating context information in sentences captured in front and back directions into text embedding by using Bi-GRU to obtain text vocabulary feature vectors.
The local alignment module firstly obtains a segment level cross-modal weighted feature vector through a bidirectional cross-attention mechanism, then calculates to obtain a segment level matching score related to a region and a segment level matching score related to a vocabulary, and finally averages all segment level matching scores based on the region and segment level matching scores based on the vocabulary to obtain a final segment level matching score of a complete image and a text.
The global context alignment module firstly uses a multi-head self-attention mechanism to acquire the context relation between fragment level features, acquires an enhanced feature vector fused with context information, then calculates global matching scores of two directions from image to text and text to image respectively, and finally takes the sum of the global matching scores of the two directions as the global matching score of the complete image and the text.
And the social media blog detection module is used for inputting the segment level similarity and the global similarity between the images and the texts into a full-connection layer with a softmax function to obtain a detection result that the social media blog is real information or false information.
It will be apparent to those skilled in the art that the steps of the method for detecting false information based on hierarchical alignment of multi-modal contexts or the modules of the system for detecting false information based on hierarchical alignment of multi-modal contexts of the embodiments of the present invention may be implemented by a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, so that they may be stored in a memory device for execution by the computing device, and in some cases, the steps shown or described may be performed in a different order than what is shown or described herein, or they may be fabricated separately as individual integrated circuit modules, or a plurality of modules or steps within them may be fabricated as a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
The parameters were set and experimental evaluation criteria were as follows:
a, setting parameters:
the following are several parameters affecting the present model: iteration round (Epoch), batch Size (Batch Size), net Learning rate (Learning rate), learnable linear projection function hidden layer parameter η.
Table 1 model training parameter settings
And B, evaluation indexes:
the related evaluation indexes are index combinations uniformly used by the existing method, and the method comprises the following steps: accuracy, precision, recall and F1-Score.
According to the above embodiments, the invention aims at the importance of neglecting context information when modeling multi-modal fine granularity corresponding relation in false information of imaginary events by using exaggeration technique and the alignment problem that the model can not capture different modal contexts
A two-step detection strategy is presented. The invention uses the local alignment module to learn the complementary information among multiple modes, and can solve the problem that certain information in the text cannot strictly correspond to the image area. The invention uses the global context alignment module to realize the global matching of the image and the text, can solve the problem that the matching network is misled by the image semantics and ignores the context alignment, and improves the accuracy of false information detection.

Claims (10)

1. The false information detection method based on multi-mode context hierarchical step-by-step alignment is characterized by comprising the following steps:
(1) Extracting basic features of images and texts, and respectively extracting features of images and texts in social media blogs by using a target detection model and a language characterization model to respectively obtain basic feature vectors of visual areas and text words;
(2) Extracting image and text weighted characteristics, namely aligning the same segment in different modes for the image and text basic characteristic vectors extracted in the step (1) by utilizing a bidirectional cross attention mechanism, and then taking text vocabulary as a benchmark to obtain an image area basic characteristic vector weighted sum for describing the vocabulary as a weighted characteristic vector of the vocabulary; similarly, taking an image area as a reference to obtain a text vocabulary basic feature vector weighted sum describing the area as a weighted feature vector of the area;
(3) Image-text feature local matching, namely matching the image and text basic feature vector extracted in the step (1) with the image and text weighted feature vector extracted in the step (2) by utilizing similarity to obtain a local fragment level matching result;
(4) Extracting image and text enhancement features, namely exploring the context relation among features by utilizing a multi-head self-attention mechanism for the basic feature vectors of the image and the text extracted in the step (1) to respectively obtain the image and the text enhancement feature vectors fused with the context information;
(5) Matching the image-text enhancement features, and obtaining a global correlation matching result by utilizing a self-adaptive weight filtering module for the image and text enhancement features extracted in the step (4);
(6) And (3) detecting social media blogs, namely inputting the local segment level matching result obtained in the step (3) and the global relevance matching result obtained in the step (5) into a binary classifier, projecting the sum vector of two levels of result vectors to two target spaces of real information and false information by using a full-connection layer with a softmax function in the binary classifier, and obtaining the detection result of the social media blogs.
2. The method for detecting false information based on multi-modal context hierarchical step-by-step alignment according to claim 1, wherein in the step (1), an image X and a text Y are defined, first image region basic features are extracted by using a fast R-CNN based on Resnet-101, and then text vocabulary basic features are extracted by using a Bi-GRU network;
the specific steps of extracting the image area characteristics and the text vocabulary characteristics are as follows:
(1.1) extraction of basic features of an image area: for an image X, selecting a Faster R-CNN to extract image region characteristics, and selecting the first m regions to represent X according to the scores of the image regions; then a feature vector f is extracted for each image region using a pre-trained ResNet-101 i ,i∈[1,m]The method comprises the steps of carrying out a first treatment on the surface of the Finally, the full connection layer is utilized to convert the feature vector into 1024-dimensional feature vectors, and finally the feature vectors are expressed as a group of regional features { x } 1 ,x 2 ,...,x m X E R as an essential feature of an image region d×m As shown in formula (1):
x i =W x f i +b x (1)
(1.2) extraction of basic characteristics of text vocabulary: for a sentence of text Y, each vocabulary is first encoded into one-hot vector g t ,t∈[1,n]And through a parameter matrix W g And offset vector b g Embedded in 300-dimensional vector o t As shown in formula (2):
o t =W g g t +b g (2)
next, bi-GRU is used to integrate context information in sentences captured from both front and back directions into text embedding, and then the final word feature y is obtained by averaging t As basic characteristics of text vocabulary, the following formula (3) is shown:
3. the method for detecting false information based on hierarchical alignment of multi-modal contexts according to claim 2, wherein in (2), the multi-modal weighted feature vectors of the image region and the text vocabulary are obtained by using a bi-directional cross-attention mechanism from the image region basic feature vector and the text vocabulary basic feature vector;
the specific steps of extracting the weighted feature vectors of the image area and the text vocabulary are as follows:
(2.1) calculating a similarity matrix of the image area and the text vocabulary: as shown in formula (4):
A=(W x X)(W y Y) T (4)
wherein W is x And W is y Is a parameter that can be learned, x= { X 1 ,x 2 ,...,x m Sum y= { Y 1 ,y 2 ,...,y n The basic features of the image area and the basic features of the text vocabulary extracted in the step (1.1) and the step (1.2) are respectively, A epsilon R m×n Is the similarity matrix of image area-text vocabulary, A it Representing semantic similarity of the ith region and the t vocabulary;
(2.2) image region weighted feature extraction: will be certainThe image areas are used as the standard, weights are distributed to n vocabularies, and then the n vocabularies are weighted and combined to obtain an image area x i Text lexical feature y i * The process is shown as a formula (5):
wherein λ is a temperature parameter that is a function of softmax;
(2.3) text vocabulary weighted feature extraction: similarly, a text-based vocabulary y is obtained by taking a certain text vocabulary as a reference t Corresponding image region feature x t * The process is shown as a formula (6):
4. the method for detecting false information based on multi-modal context hierarchical step-by-step alignment according to claim 1, wherein in the step (3), the image and text basic feature vectors extracted in the step (1) and the image and text weighted feature vectors extracted in the step (2) are subjected to segment level matching, and the segment level matching scores of the complete image and the complete text are obtained on the basis of the segment level matching;
the specific steps for calculating the fragment level similarity are as follows:
(3.1) calculating a fragment-level matching score for the image region correlation: for image region x i Text vocabulary feature y corresponding to the region i * Calculating to obtain similarityAs shown in formula (7):
(3.2) calculating a text vocabulary related segment level matching score: for text vocabulary y t Image region feature x corresponding to the vocabulary t * Calculating to obtain similarityAs shown in formula (8):
(3.3) calculating a segment-level matching score S of the complete image and the complete text segment (X, Y) as shown in formula (9):
where η is a superparameter for balancing the region-dependent segment-level matching scoresVocabulary-related fragment level matching score +.>Is a contribution of (a).
5. The method for detecting false information based on multi-modal context hierarchical step alignment according to claim 1, wherein in the step (4), the image and text basic feature vectors extracted in the step (1) are explored by using a multi-head self-attention mechanism to explore the context relation between features, and the image and text enhancement feature vectors fused with the context information are obtained, and the specific steps are as follows:
(4.1) image enhancement feature extraction: a set of regional features x= { X for one image X 1 ,x 2 ,...,x m Exploring regions using multi-headed self-attention mechanismsContextual relationships between features; first define query Q X Key K X Sum value Val X As shown in formula (10):
Q X =XW Q ,K X =XW K ,Val X =XW Val (10)
wherein W is Q ,W K ,W Val Is a learnable feature mapping matrix; calculated value Val by (11) X Is a weighted sum of:
wherein d is k Is a scale factor, a weight matrix Q X K X T Is a matrix that encodes the relationship of each visual feature to all other features by inner lamination; the relationships between the different objects are then encoded with equation (12):
where j represents the j-th head,w in (2) represents a learnable feature mapping matrix, superscript Q X Representing this matrix for an image query, where Q represents the query, x represents for the image, and subscript j represents the j-th header; thus (S)>Refers to a query weight matrix specifically used to handle image input at the j-th head in a multi-head attention mechanism; similarly, a->And->Representing a key weight matrix and a value weight matrix, respectively, dedicated to processing the image input at the j-th head.
Finally, the i-th original visual zone feature x i Is converted into enhanced features x by means of a combination of global objects and local guided objects ip As shown in formula (13):
wherein W is p Is a learnable weight matrix, h is the number of heads; then for an image the set x= { X based on region features 1 ,x 2 ,...,x m The enhanced feature of } can be represented as X reinforcement ={x 1p ,x 2p ,...,x mp };
(4.2) text enhancement feature extraction: text feature y= { Y for a piece of text Y 1 ,y 2 ,...,y n Performing the same procedure as in step (4.1) to obtain text enhancement feature Y reinforcement ={y 1p ,y 2p ,...,y np };
(4.3) weighted enhancement feature extraction: for the weighted text feature X extracted in step (2) * ={x 1 * ,x 2 * ,...,x n * Sum image feature Y * ={y 1 * ,y 2 * ,...,y m * The same process as in step (4.1) is performed to obtain weighted features after enhancement, denoted as X respectively reinforcement * ={x 1p * ,x 2p * ,...,x np * },Y reinforcement * ={y 1p * ,y 2p * ,...,y mp * }。
6. The method for detecting false information based on multi-mode context hierarchical step alignment according to claim 1, wherein in the step (5), the image and text enhancement features extracted in the step (4) are filtered by using an adaptive weight filtering module to obtain a global relevance matching result, and the specific steps are as follows:
(5.1) calculating an image-to-text direction global matching score: combining the enhanced basic feature pair (X) obtained in step (4) reinforcement ,Y reinforcement ) And enhancing the weighted feature pair (Y reinforcement * ,X reinforcement * ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction x→y As shown in formula (14):
wherein w is x =cos(x lp ,y lp * ) For adaptive weights, the enhanced base feature x is represented lp Enhancement weighting feature y lp * Is of importance of (2);
(5.2) calculating a text-to-image direction global matching score: will enhance the basic feature pair (X reinforcement ,Y reinforcement ) And enhancing the weighted feature pair (Y reinforcement * ,X reinforcement * ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction y→x As shown in formula (15):
similar to (5.1), w y =cos(y lp ,x lp * ) For adaptive weights, the enhanced base features y are represented lp Enhancement weighting feature x lp * Is of importance of (2);
(5.3) calculating a global matching score S of the complete image and the complete text global (X, Y) as shown in formula (16):
S global (X,Y)=s x→y +s y→x (16)。
7. the method for detecting false information based on hierarchical alignment of multi-modal context according to claim 1, wherein in the step (6), the local segment level matching result obtained in the step (3) and the global matching result obtained in the step (5) are input into a binary classifier, and the sum vector of the result vectors of two levels is projected to the target space of both real information and false information by using the full connection layer with softmax function in the binary classifier, and the detection result of the social media blog is obtained, which comprises the following specific steps:
(6.1) after obtaining the local region-vocabulary matching result and the global matching result, using a fully connected layer with a softmax function to match S segment (X, Y) and S global (X, Y) projects into the target space of only two categories, real or fake, and gets its probability distribution as shown in equation (17):
wherein p= [ p ] 0 ,p 1 ]Representing a predictive probability vector, p 0 And p 1 Respectively representing the probabilities of the prediction results of 0 and 1, wherein 0 represents real and 1 represents fake; w is the weight matrix and b is the bias term.
(6.2) for each blog, the goal is to minimize a binary cross entropy loss function to distinguish whether the blog is real information or false information; the loss function is shown in equation (18):
L p =-[ylnp 0 +(1-y)lnp 1 ] (18)
where y e {0,1} is the authenticity label for each blog, y=0 indicates that the blog is authentic, and y=1 indicates that the blog contains false information.
8. The false information detection system based on multi-mode context hierarchical step-by-step alignment is characterized by comprising an image region feature extraction module, a text vocabulary feature extraction module, a local alignment module, a global context alignment module and a social media blog detection module;
the image region feature extraction module firstly detects a significant region by using Faster R-CNN, then extracts a feature vector for each image region by using pretrained ResNet-101, and finally converts the feature vector into a feature vector by using a full connection layer to obtain an image region feature vector;
the text vocabulary feature extraction module firstly codes each vocabulary into one-hot vectors, and embeds the one-hot vectors into the vector o through parameter matrixes and offset vectors t Then, capturing context information in sentences from front and back directions by using Bi-GRU, and integrating the context information into text embedding to obtain text vocabulary feature vectors;
the local alignment module firstly acquires a segment level cross-modal weighted feature vector through a bidirectional cross-attention mechanism, then calculates to obtain a segment level matching score related to a region and a segment level matching score related to a vocabulary, and finally averages all segment level matching scores based on the region and segment level matching scores based on the vocabulary to obtain a final segment level matching score of a complete image and a text;
the global context alignment module firstly uses a multi-head self-attention mechanism to acquire a context relation between fragment-level features, acquires an enhanced feature vector fused with context information, then calculates global matching scores of an image to a text and the text to the image respectively, and finally takes the sum of the global matching scores of the two directions as a global matching score of a complete image and the text;
the social media blog detection module inputs the segment-level similarity and the global similarity between the images and the texts into a full-connection layer with a softmax function to obtain a detection result that the social media blog is real information or false information.
9. A computer device, characterized by: the computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for false information detection based on multi-modal context hierarchical step-wise alignment as claimed in any one of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program for performing a method of false information detection based on multi-modal context hierarchical step-wise alignment as claimed in any one of claims 1-7.
CN202311569509.3A 2023-11-23 2023-11-23 False information detection method based on multi-mode context hierarchical step alignment Pending CN117521012A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311569509.3A CN117521012A (en) 2023-11-23 2023-11-23 False information detection method based on multi-mode context hierarchical step alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311569509.3A CN117521012A (en) 2023-11-23 2023-11-23 False information detection method based on multi-mode context hierarchical step alignment

Publications (1)

Publication Number Publication Date
CN117521012A true CN117521012A (en) 2024-02-06

Family

ID=89752805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311569509.3A Pending CN117521012A (en) 2023-11-23 2023-11-23 False information detection method based on multi-mode context hierarchical step alignment

Country Status (1)

Country Link
CN (1) CN117521012A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117952563A (en) * 2024-03-21 2024-04-30 武汉市特种设备监督检验所 Quick registration and examination method and system in elevator information system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN114662586A (en) * 2022-03-18 2022-06-24 南京邮电大学 Method for detecting false information based on common attention multi-mode fusion mechanism
CN116452939A (en) * 2023-05-11 2023-07-18 河海大学 Social media false information detection method based on multi-modal entity fusion and alignment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN114662586A (en) * 2022-03-18 2022-06-24 南京邮电大学 Method for detecting false information based on common attention multi-mode fusion mechanism
CN116452939A (en) * 2023-05-11 2023-07-18 河海大学 Social media false information detection method based on multi-modal entity fusion and alignment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEI CUI: "Multi-view mutual learning network for multimodal fake news detection", SSRN, 6 April 2023 (2023-04-06), pages 1 - 31 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117952563A (en) * 2024-03-21 2024-04-30 武汉市特种设备监督检验所 Quick registration and examination method and system in elevator information system

Similar Documents

Publication Publication Date Title
Li et al. Zero-shot event detection via event-adaptive concept relevance mining
CN111581405A (en) Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
Perez-Martin et al. Improving video captioning with temporal composition of a visual-syntactic embedding
CN112734881B (en) Text synthesized image method and system based on saliency scene graph analysis
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN113934882A (en) Fine-grained multi-mode false news detection method
CN113094566A (en) Deep confrontation multi-mode data clustering method
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
CN114662497A (en) False news detection method based on cooperative neural network
CN117521012A (en) False information detection method based on multi-mode context hierarchical step alignment
CN112347223A (en) Document retrieval method, document retrieval equipment and computer-readable storage medium
CN110659392B (en) Retrieval method and device, and storage medium
Xing et al. Ventral & dorsal stream theory based zero-shot action recognition
CN113656660A (en) Cross-modal data matching method, device, equipment and medium
Narayana et al. Huse: Hierarchical universal semantic embeddings
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116955730A (en) Training method of feature extraction model, content recommendation method and device
CN116452939A (en) Social media false information detection method based on multi-modal entity fusion and alignment
CN110867225A (en) Character-level clinical concept extraction named entity recognition method and system
Nam et al. A survey on multimodal bidirectional machine learning translation of image and natural language processing
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
CN117390299A (en) Interpretable false news detection method based on graph evidence
CN117094291A (en) Automatic news generation system based on intelligent writing
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination