CN112861945A - Multi-mode fusion lie detection method - Google Patents
Multi-mode fusion lie detection method Download PDFInfo
- Publication number
- CN112861945A CN112861945A CN202110118420.XA CN202110118420A CN112861945A CN 112861945 A CN112861945 A CN 112861945A CN 202110118420 A CN202110118420 A CN 202110118420A CN 112861945 A CN112861945 A CN 112861945A
- Authority
- CN
- China
- Prior art keywords
- matrix
- feature
- neural network
- lie
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/251—Fusion techniques of input or preprocessed data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a multi-mode fusion lie detection method, which is based on a long-short term memory cyclic neural network, a feedforward full-connection neural network and a convolution neural network to obtain a text, video and audio characteristic matrix; then acquiring an inter-problem association graph neural network based on the text, video and audio fusion feature matrix of each problem and the association between the problems of each individual; then based on the inter-problem association graph neural network, obtaining an association expression matrix of each problem node on the graph neural network; and finally, obtaining a classification vector for judging whether the individual lies on the problem or not based on the associated expression matrix of each problem node. The invention is suitable for processing the lie detection problem of the problem level with finer granularity than the current video level lie detection under the real court trial or trial situation.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a multi-mode fusion lie detection method.
Background
Lie is common in everyday life. Some lie is unbounded and not malicious. However, some lie, such as in court or criminal judgment, are destructive and have adverse effects on justice and social security. Whether these lie claims are related to sense and truth can be judged. Therefore, accurate and effective deception detection can provide crucial support for evaluating the authenticity of the testimony and determining the judgment.
Existing lie detection for real trial or interrogation situations only focuses on the video level of the lie detection task, i.e. a detection result is given for a given video segment, whether this segment contains one or more interrogation problems. However, the authenticity of the answers given to each trial question is necessary in a real trial, and the performance of the same individual's answers before and after the answers also has relevance, which should be used to improve the performance of question-level lie detection.
As can be seen from the above description, there is currently no lie detection method and apparatus that considers the question level of relevance of individual context response performance.
Object of the Invention
The invention aims to provide a multi-mode fusion lie detection method aiming at the problems in the prior art. Specifically, aiming at the lie detection task at the problem level, a multi-mode fusion lie detection method is provided, the fusion of information of text, vision and audio modes is realized, the association learning among different problems of the same individual is carried out through a graph neural network, and the association of individual behaviors of different problems is discovered, so that the lie detection performance is improved. The method mainly comprises two aspects: 1) text, visual and audio feature expression extraction and multi-mode fusion; 2) associative feature learning and lie detection classification based on graph neural networks.
Disclosure of Invention
The invention discloses a multi-mode fusion lie detection method, which comprises the following steps:
step 1, expressing, extracting and multi-modal fusing relevant characteristics of lie behaviors including texts, vision and audio, and specifically comprising the following steps of:
acquiring a lie-expression behavior related characteristic expression matrix of the text based on the long-short term memory cyclic neural network and the feedforward full-link neural network;
acquiring a lie-based behavior related characteristic expression matrix of vision based on a convolutional neural network and a long-short term memory recurrent neural network;
acquiring a lie behavior related characteristic expression matrix of the audio based on the audio processing deep neural network;
acquiring a multi-mode fusion characteristic matrix based on the acquired lie behavior related characteristic expression matrix of the text, the vision and the audio;
step 2, carrying out associated feature learning and lie detection classification based on the graph neural network, and specifically comprising the following steps:
acquiring a full-connected representation diagram based on the multi-modal fusion feature matrix acquired in the step 1, and taking the multi-modal fusion feature matrix corresponding to the single problem behavior of each user as a node;
acquiring an expression matrix corresponding to a single problem behavior of each user based on the full-connectivity expression diagram and the attention mechanism;
and acquiring whether the detected classification result lies or not based on the acquired expression matrix corresponding to the single problem behavior of each user and the feedforward fully-connected neural network.
Preferably, the obtaining of the lie-behavior related feature expression matrix of the text in step 1 specifically includes:
abstracting text into a sentence containing n words, with w ═ w1,w2,…,wnDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of words
Acquiring text information in the sequence by using a long-short term memory recurrent neural network (LSTM), specifically, inputting a word vector feature matrix X into an LSTM layer, wherein the LSTM layer is a bidirectional LSTM and passes through positiveTwo LSTMs in the direction and reverse direction are respectively used for each word vector xiGenerating a hidden stateAndadding the two hidden state vectors, and splicing the n hidden states to form a text representation matrix H, which is represented by the following formula (1):
based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained, and the weight is given to the text representation matrix, i.e. the importance weight distribution matrix Ast, by bit multiplication, as shown in equations (2) and (3):
Ast=softmax(w1H+b1) (2),
H′=H×Ast+H (3),
wherein the content of the first and second substances,a distribution matrix of the importance weights is represented,andis a trainable parameter vector, softmax is an activation function,
Obtaining a text characteristic expression matrix F through a feedforward full-connection neural networkTAs shown in formula (4):
FT=ReLU(W2H′+b2) (4),
preferably, n is 20.
Preferably, the acquiring a visual lie behavior related feature expression matrix in step 1 specifically includes:
abstracting a video into a sequence of images containing T consecutive frames, using V ═ frame1,frame2,…,frameTRepresents;
for each frame in V, a pre-trained model Resnet with weight sharingiA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,
the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and the T feature maps are spliced into a matrix, as shown in formulas (5) and (6):
M′=concat([M′1,M′2,…,MT′]) (6),
paying attention to an image which contributes more to judging whether lie exists in an image sequence by applying an attention mechanism, and learning the time sequence change of the image in the video through a long-short term memory recurrent neural network (LSTM);
obtaining the importance distribution of each characteristic diagram in the characteristic diagram matrix by applying an attention mechanism, and obtaining the video characteristic diagram matrix with the weight redistributed according to the importance distribution by bit multiplicationAs shown in formulas (7) and (8):
Asv=softmax(W3M′+b3) (7),
wherein the content of the first and second substances,is the importance weight distribution matrix, softmax is an activation function,andis a vector of parameters that can be trained,
wherein h isiAnd hi-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequenceTAs a time-series feature expression matrix of the video,
h is transmitted through a feedforward full-connection networkTMapping to FTIn the same dimensional space, as shown in equation (10):
Fv=ReLU(W4hT+b4) (10),
wherein the content of the first and second substances,is a matrix of the expression of the characteristics of the video,andis a trainable vector of parameters and ReLU is an activation function.
Preferably, T-20.
Preferably, the obtaining of the lie-behavior related feature expression matrix of the audio in step 1 specifically includes:
abstracting audio to a continuous sequence of sound signals, with a ═ a1,a2,…,aLRepresents;
extracting feature expression from an original sound signal by using a pre-trained SincNet network to obtain DAAs shown in formula (11):
by feedforward fully-connected neural networks, DAMapping to FTIn the same dimensional space, as shown in equation (12):
FA=ReLU(W5DA+b5) (12),
wherein the content of the first and second substances,is a matrix of the expression of the audio features,andis a trainable vector of parameters and ReLU is an activation function.
Preferably, in the sound signal sequence a, L is 200.
Preferably, the obtaining the multi-modal fusion feature matrix in step 1 specifically includes:
splicing the obtained text, video and audio feature expression matrixes together to obtain a multi-mode feature matrix FM', as shown in formula (13):
FM′=concat([FT,FV,FA]) (13),
Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix FMAs shown in formulas (14) and (15):
Amm=softmax(W6FM′+b6) (14),
FM=Amm×FM′+FM′ (15),
wherein the content of the first and second substances,is the importance weightThe distribution matrix, softmax, is an activation function,andis a vector of parameters that can be trained,is a multi-modal fusion feature matrix.
Preferably, the learning of the associated features based on the graph neural network in step 2 specifically includes:
representing the set of user individuals as O ═ { O ═ O1,o2,…,omAnd (c) the step of (c) in which,oiis an individual, each individual containing responses to a number of questions asked, i.e. responses
This graph neural network is defined as G (U, E), where U represents the set of problem nodes,that is, each problem node belongs to one of the O units;
each problem node is represented by a multi-mode fusion feature matrix obtained by the behavior corresponding to the problem, E represents a set of edges, and if any two nodes in U satisfy Ui∈ogAnd u isj∈og(ogE is equal to O), then eij=1,eji=1;
For problem uiIs represented by a vectorApplying neighbour influences, i.e. the same individual to a questionThe performance of questions can be influenced by responses of other questions, and the responses of the same individual to different questions are similar;
through a fully connected layer, fromTo find out any problem uiRepresents the initial hidden state of the vectorAs shown in equation (16):
for each problem pair (u) with an edgei,uj) It is expressed as a vectorAndsplicing and obtaining attention factor by using a full connection layerAs shown in equation (17):
wherein, | ogIs u |iThe number of adjacent neighbors;
obtaining each neighbor node pair problem u through softmax functioniInfluence exertedAs shown in equation (19):
will be provided withIs shown asBy passingUpdating uiIs in a hidden state ofAs shown in equation (20):
obtaining a final expression matrix for a node problem using a layer of fully-connected networkAs shown in equation (21):
Preferably, the lie detection classification based on the graph neural network in step 2 specifically includes:
obtaining a classification result by using a feedforward fully-connected neural network, as shown in formula (22):
wherein y is0,y1Indicates to problem uiWith or without the occurrence of fraudulent activity,are trainable parameters.
Preferably, the lie detection method is used for lie detection in a real court trial or trial situation.
Drawings
FIG. 1 is a diagram illustrating the extraction of expression and multi-modal fusion of relevant features including lie behavior of text, vision, and audio according to the present invention.
FIG. 2 is a flow chart of associative feature learning and lie detection classification based on graph neural networks according to the present invention.
Detailed Description
The lie detection method for problem-level multi-modal fusion according to the present invention is described in detail below with reference to the accompanying drawings.
Aiming at the lie detection task at the problem level, the invention provides a multi-mode fusion method, which realizes the fusion of information of text, vision and audio modes, simultaneously performs the association learning among different problems of the same individual through a graph neural network, and finds out the association of individual behaviors of different problems so as to improve the lie detection performance. The method mainly comprises the following two steps: 1) text, visual and audio feature expression extraction and multi-mode fusion; 2) associative feature learning and lie detection classification based on graph neural networks.
FIG. 1 is a diagram illustrating the extraction of expression and multi-modal fusion of relevant features including lie behavior of text, vision, and audio according to the present invention. As shown in the figure, relevant features of lie behaviors including texts, videos and audios are expressed and extracted to form corresponding matrixes, and multi-mode fusion feature matrixes are obtained after multi-mode fusion.
The related specific process is as follows:
feature expression extraction and multimodal fusion
The goal of this section is to obtain the three modality feature expression matrices and the fusion of the three modality feature expression matrices, respectively.
1. Text feature expression extraction
Abstracting text into a sentence containing n words, with w ═ w1,w2,…,wnDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of wordsIn the present invention, n is 20.
To learn lie-related text feature expressions from semantic relations, a long-short term memory recurrent neural network LSTM is used to obtain text information in a sequence. In addition to this, the focus on the more important words is based on an attention mechanism, taking into account that different words in the sentence contribute differently to the detection result.
First, the word vector feature matrix X is input into the LSTM layer, using bi-directional LSTM, each word vector X can be represented by two LSTMs in forward and backward directionsiGenerating a hidden stateAndadding two hidden state vectors, and splicing n hidden states to form a text representation matrix H
based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained and is given to the text representation matrix through bit multiplication:
Ast=softmax(W1H+b1)
H′=H×Ast+H
wherein the content of the first and second substances,a distribution matrix of the importance weights is represented,andis a trainable parameter vector, softmax is an activation function,
converting H' from two-dimensional to one-dimensional vectorObtaining a text characteristic expression matrix F through a feedforward full-connection neural networkT。
FT=ReLU[W2H′+b2)
2. visual feature expression extraction
Abstracting a video into a sequence of images containing T consecutive frames, using v ═ frame1,frame2,…,farmeTRepresents it. In the present invention, T is 20. First, for each frame in V, through a weight-shared pre-training model ResnetiA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and T feature maps are spliced into a matrix:
M′=concat[[M′1,M′2,…,MT′])
To model the timing information in video, a mechanism of attention is applied to pay attention to the images in the image sequence that contribute more to determining whether to lie, and the timing variation of the images in the video is learned through the long-short term memory recurrent neural network LSTM.
Because each graph is contained in the characteristic graph matrix M' of the videoLike the corresponding feature map matrix, therefore, firstly, the attention mechanism is applied to obtain the importance distribution of each feature map in the feature map matrix, and the video feature map matrix with the weight redistributed according to the importance distribution is obtained by the counterpoint multiplication
Asv=softmax(W3M′+b3)
Wherein the content of the first and second substances,is the importance weight distribution matrix, softmax is an activation function,andis a vector of parameters that can be trained,
wherein h isiAnd hi-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequenceTAs a time-series feature expression matrix of the video,
finally, h is fed through a feed-forward fully-connected networkTMapping to FTIn the same dimensional space:
FV=ReLU(W4hT+b4)
wherein the content of the first and second substances,is a matrix of the expression of the characteristics of the video,andis a trainable vector of parameters and ReLU is an activation function.
3. Audio feature expression extraction
A ═ a is used for a sequence for abstracting audio into a continuous sound signal1,a2,…,aLRepresents it. In the present invention, L is 200. Extracting feature expression from original sound signals by using a pre-trained SincNet network to obtain
D is generated by feedforward fully-connected neural networkAMapping to FTIn the same dimensional space:
FA=ReLU(W5DA+b5)
wherein the content of the first and second substances,is a matrix of the expression of the audio features,andis a trainable vector of parameters and ReLU is an activation function.
4. Feature expression matrix fusion
Splicing the text, video and audio feature expression matrixes obtained above to obtain a multi-mode feature matrix FM′:
FM′=concat([FT,FV,FA])
Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix FM:
Amm=softmax(W6FM′+b6)
FM=Amm×FM′+FM′
Wherein the content of the first and second substances,is the importance weight distribution matrix, softmax is an activation function,andis a vector of parameters that can be trained,is a multi-modal fusion feature matrix.
Fig. 2 is a flowchart of the learning of associated features and lie detection classification based on the graph neural network according to the present invention, and it can be seen from the figure that, for the obtained multi-modal fusion feature matrix, the learning of associated features and the lie detection classification based on the graph neural network are performed to determine whether an individual lies, and the process is specifically described as follows:
associative feature learning and lie detection classification based on graph neural networks
The aim of the part is to learn the relevance of behavior performances corresponding to different problems by constructing a graph neural network and classify whether the questions lie or not.
1. Associative feature learning based on graph neural networks
First, how to construct the problem map is described. Representing the set of user individuals as O ═ { O ═ O1,o2,…,omAnd (c) the step of (c) in which,oiis an individual, each individual possibly containing responses to several questions asked, i.e.This graph is defined as G (U, E), where U represents the set of problem nodes,that is, for each problem node belonging to one individual in O, the problem node is represented by the multi-modal fusion feature matrix obtained by the behavior corresponding to the problem, E represents the set of edges, and if any two nodes in U satisfy Ui∈ogAnd u isj∈og(ogE.o), then eij=1,eji=1。
For problem uiIs represented by a vectorNeighbor influence is applied, that is, the performance of a certain question by the same individual is influenced by the responses of other questions, and the responses of the same individual to different questions have similarity. Considering different problem pairs uiThe different effects of the responsive performance employ a mechanism of attention at adjacent question levels.
First, through a full link layer, fromTo find out any problem uiRepresents the initial hidden state of the vector
For each problem pair (u) with an edgei,uj) It is expressed as a vectorAndsplicing and obtaining attention factor by using a full connection layer
For ui∈ogIn this way, a vector of the neighbor attention coefficients of ui is obtained:wherein, | ogIs u |iThe number of adjacent neighbors is obtained, and each neighbor node pair problem u is obtained through a softmax functioniInfluence exerted:
Finally, a layer of fully-connected network is used for obtaining a final expression matrix of the node problem
Lie detection classification
Using a feed-forward fully-connected neural network to obtain classification results:
wherein y is0,y1To representFor problem uiWith or without the occurrence of fraudulent activity,are trainable parameters.
The invention has the advantages of
The invention redefines the existing lie detection task by constructing a graph neural network associated with the question, considers the relevance of the performances of the same individual when answering the question, and simultaneously makes up the defect of single data source lie detection by a multi-mode fusion mode.
Claims (11)
1. A multi-modal fusion lie detection method is characterized by comprising the following steps:
step 1, expressing, extracting and multi-modal fusing relevant characteristics of lie behaviors including texts, vision and audio, and specifically comprising the following steps of:
acquiring a lie-expression behavior related characteristic expression matrix of the text based on the long-short term memory cyclic neural network and the feedforward full-link neural network;
acquiring a lie-based behavior related characteristic expression matrix of vision based on a convolutional neural network and a long-short term memory recurrent neural network;
acquiring a lie behavior related characteristic expression matrix of the audio based on the audio processing deep neural network;
acquiring a multi-mode fusion characteristic matrix based on the acquired lie behavior related characteristic expression matrix of the text, the vision and the audio;
step 2, carrying out associated feature learning and lie detection classification based on the graph neural network, and specifically comprising the following steps:
acquiring a full-connected representation diagram based on the multi-modal fusion feature matrix acquired in the step 1, and taking the multi-modal fusion feature matrix corresponding to the single problem behavior of each user as a node;
acquiring an expression matrix corresponding to a single problem behavior of each user based on the full-connectivity expression diagram and the attention mechanism;
and acquiring whether the detected classification result lies or not based on the acquired expression matrix corresponding to the single problem behavior of each user and the feedforward fully-connected neural network.
2. The multi-modal fusion lie detection method according to claim 1, wherein the obtaining of the lie behavior related feature expression matrix of the text in step 1 specifically comprises:
abstracting text into a sentence containing n words, with w ═ w1,w2,…,wnDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of words
Acquiring text information in a sequence by using a long-short term memory recurrent neural network (LSTM), specifically, inputting a word vector feature matrix X into an LSTM layer, wherein the LSTM layer is a bidirectional LSTM, and each word vector X is respectively obtained by a forward LSTM and a backward LSTMiGenerating a hidden stateAndadding the two hidden state vectors, and splicing the n hidden states to form a text representation matrix H, which is represented by the following formula (1):
based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained, and the weight is given to the text representation matrix, i.e. the importance weight distribution matrix Ast, by bit multiplication, as shown in equations (2) and (3):
Ast=softmax(W1H+b1) (2),
H′=H×Ast+H (3),
wherein the content of the first and second substances,a distribution matrix of the importance weights is represented,andis a trainable parameter vector, softmax is an activation function,
Obtaining a text characteristic expression matrix F through a feedforward full-connection neural networkTAs shown in formula (4):
FT=ReLU(W2H′+b2) (4),
3. a multi-modal fusion lie detection method according to claim 2, wherein n-20.
4. The multi-modal fusion lie detection method according to claim 3, wherein the obtaining of the visual lie behavior related feature expression matrix in step 1 specifically comprises:
abstracting a video into a sequence of images containing T consecutive frames, using V ═ frame1,frame2,…,frameTRepresents;
for each frame in V, a pre-trained model Resnet with weight sharingiA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,
the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and the T feature maps are spliced into a matrix, as shown in formulas (5) and (6):
M′=concat([M′1,M′2,…,MT′]) (6),
paying attention to an image which contributes more to judging whether lie exists in an image sequence by applying an attention mechanism, and learning the time sequence change of the image in the video through a long-short term memory recurrent neural network (LSTM);
obtaining the importance distribution of each characteristic diagram in the characteristic diagram matrix by applying an attention mechanism, and obtaining the video characteristic diagram matrix with the weight redistributed according to the importance distribution by bit multiplicationAs shown in formulas (7) and (8):
Asv=softmax(W3M′+b3) (7),
wherein the content of the first and second substances,is the importance weight distribution matrix, softmax is an activation function,andis a vector of parameters that can be trained,
wherein h isiAnd hi-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequenceTAs a time-series feature expression matrix of the video,
h is transmitted through a feedforward full-connection networkTMapping to FTIn the same dimensional space, as shown in equation (10):
FV=ReLU(W4hT+b4) (10),
5. The method of claim 4, wherein T-20.
6. The multi-modal fusion lie detection method according to claim 5, wherein the obtaining of the lie behavior related feature expression matrix of the audio in step 1 specifically comprises:
abstracting audio to a continuous sequence of sound signals, with a ═ a1,a2,…,aLRepresents;
extracting feature expression from an original sound signal by using a pre-trained SincNet network to obtain DAAs shown in formula (11):
by feedforward fully-connected neural networks, DAMapping to FTIn the same dimensional space, as shown in equation (12):
FA=ReLU(W5DA+b5) (12),
7. The method of claim 6, wherein L is 200 in the sound signal sequence A.
8. The method for detecting the multi-modal fusion lie according to claim 7, wherein the obtaining the multi-modal fusion feature matrix in step 1 specifically includes:
splicing the obtained text, video and audio feature expression matrixes together to obtain a multi-mode feature matrix FM', as shown in formula (13):
FM′=concat([FT,FV,FA]) (13),
Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix FMAs shown in formulas (14) and (15):
Amm=softmax(W6FM′+b6) (14),
Fm=Amm×FM′+FM′ (15),
9. The multi-modal fusion lie detection method according to claim 8, wherein the performing associated feature learning based on the graph neural network in step 2 specifically comprises:
representing the set of user individuals as O ═ { O ═ O1,o2,…,omAnd (c) the step of (c) in which,oiis an individual, each individual containing responses to a number of questions asked, i.e. responses
This graph neural network is defined as G (U, E), where U represents the set of problem nodes,that is, each problem node belongs to one of the O units;
each problem node is represented by a multi-mode fusion feature matrix obtained by the behavior corresponding to the problem, E represents a set of edges, and if any two nodes in U satisfy Ui∈ogAnd u isj∈og(ogE is equal to O), then eij=1,eji=1;
For problem uiIs represented by a vectorApplying neighbor influence, namely the performance of a certain question by the same individual is influenced by the responses of other questions, and the responses of the same individual to different questions have similarity;
through a fully connected layer, fromTo find out any problem uiRepresents the initial hidden state of the vectorAs shown in equation (16):
for each problem pair (u) with an edgei,uj) It is expressed as a vectorAndsplicing and obtaining attention factor by using a full connection layerAs shown in equation (17):
wherein, | ogIs u |iThe number of adjacent neighbors;
obtaining each neighbor node pair problem u through softmax functioniInfluence exertedAs shown in equation (19):
will be provided withIs shown asBy passingUpdating uiIs in a hidden state ofAs shown in equation (20):
obtaining a final expression matrix for a node problem using a layer of fully-connected networkAs shown in equation (21):
10. The multi-modal fusion lie detection method according to claim 9, wherein the lie detection classification based on the graph neural network in step 2 specifically includes:
obtaining a classification result by using a feedforward fully-connected neural network, as shown in formula (22):
11. A multi-modal fusion lie detection method according to any of claims 1-10, used for lie detection in real court trial or trial situations.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110118420.XA CN112861945B (en) | 2021-01-28 | 2021-01-28 | Multi-mode fusion lie detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110118420.XA CN112861945B (en) | 2021-01-28 | 2021-01-28 | Multi-mode fusion lie detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112861945A true CN112861945A (en) | 2021-05-28 |
CN112861945B CN112861945B (en) | 2022-05-13 |
Family
ID=75987679
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110118420.XA Active CN112861945B (en) | 2021-01-28 | 2021-01-28 | Multi-mode fusion lie detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112861945B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240056A (en) * | 2021-07-12 | 2021-08-10 | 北京百度网讯科技有限公司 | Multi-mode data joint learning model training method and device |
CN113869276A (en) * | 2021-10-15 | 2021-12-31 | 山东大学 | Lie recognition method and system based on micro-expression |
CN114881668A (en) * | 2022-05-05 | 2022-08-09 | 灵图数据(杭州)有限公司 | Multi-mode-based deception detection method |
WO2023036159A1 (en) * | 2021-09-07 | 2023-03-16 | Huawei Technologies Co., Ltd. | Methods and devices for audio visual event localization based on dual perspective networks |
CN116522212A (en) * | 2023-07-05 | 2023-08-01 | 清华大学 | Lie detection method, device, equipment and medium based on image text fusion |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180189572A1 (en) * | 2016-12-30 | 2018-07-05 | Mitsubishi Electric Research Laboratories, Inc. | Method and System for Multi-Modal Fusion Model |
CN110301920A (en) * | 2019-06-27 | 2019-10-08 | 清华大学 | Multi-modal fusion method and device for psychological pressure detection |
CN110609891A (en) * | 2019-09-18 | 2019-12-24 | 合肥工业大学 | Visual dialog generation method based on context awareness graph neural network |
US20200335092A1 (en) * | 2019-04-20 | 2020-10-22 | Behavioral Signal Technologies, Inc. | Deep hierarchical fusion for machine intelligence applications |
-
2021
- 2021-01-28 CN CN202110118420.XA patent/CN112861945B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180189572A1 (en) * | 2016-12-30 | 2018-07-05 | Mitsubishi Electric Research Laboratories, Inc. | Method and System for Multi-Modal Fusion Model |
US20200335092A1 (en) * | 2019-04-20 | 2020-10-22 | Behavioral Signal Technologies, Inc. | Deep hierarchical fusion for machine intelligence applications |
CN110301920A (en) * | 2019-06-27 | 2019-10-08 | 清华大学 | Multi-modal fusion method and device for psychological pressure detection |
CN110609891A (en) * | 2019-09-18 | 2019-12-24 | 合肥工业大学 | Visual dialog generation method based on context awareness graph neural network |
Non-Patent Citations (2)
Title |
---|
DONGLI WANG,ETC.: "Human action recognition based on multi-mode spatial-temporal feature fusion", 《22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION》 * |
胡骏: "社会多媒体网络表示学习研究", 《中国优秀博士学位论文全文数据库信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240056A (en) * | 2021-07-12 | 2021-08-10 | 北京百度网讯科技有限公司 | Multi-mode data joint learning model training method and device |
WO2023036159A1 (en) * | 2021-09-07 | 2023-03-16 | Huawei Technologies Co., Ltd. | Methods and devices for audio visual event localization based on dual perspective networks |
CN113869276A (en) * | 2021-10-15 | 2021-12-31 | 山东大学 | Lie recognition method and system based on micro-expression |
CN114881668A (en) * | 2022-05-05 | 2022-08-09 | 灵图数据(杭州)有限公司 | Multi-mode-based deception detection method |
CN116522212A (en) * | 2023-07-05 | 2023-08-01 | 清华大学 | Lie detection method, device, equipment and medium based on image text fusion |
CN116522212B (en) * | 2023-07-05 | 2023-09-26 | 清华大学 | Lie detection method, device, equipment and medium based on image text fusion |
Also Published As
Publication number | Publication date |
---|---|
CN112861945B (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112861945B (en) | Multi-mode fusion lie detection method | |
KR102071582B1 (en) | Method and apparatus for classifying a class to which a sentence belongs by using deep neural network | |
CN109583501B (en) | Method, device, equipment and medium for generating image classification and classification recognition model | |
Alkhulaifi et al. | Knowledge distillation in deep learning and its applications | |
CN111597830A (en) | Multi-modal machine learning-based translation method, device, equipment and storage medium | |
CN111414461B (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN111368074A (en) | Link prediction method based on network structure and text information | |
US20200134455A1 (en) | Apparatus and method for training deep learning model | |
CN112818861A (en) | Emotion classification method and system based on multi-mode context semantic features | |
CN111582397A (en) | CNN-RNN image emotion analysis method based on attention mechanism | |
CN111178319A (en) | Video behavior identification method based on compression reward and punishment mechanism | |
CN112651940B (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN111598118A (en) | Visual question-answering task implementation method and system | |
CN113628059A (en) | Associated user identification method and device based on multilayer graph attention network | |
CN113988079A (en) | Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method | |
CN112131345A (en) | Text quality identification method, device, equipment and storage medium | |
CN113849599B (en) | Joint false news detection method based on mode information and fact information | |
CN113627550A (en) | Image-text emotion analysis method based on multi-mode fusion | |
CN116206227B (en) | Picture examination system and method for 5G rich media information, electronic equipment and medium | |
CN117349402A (en) | Emotion cause pair identification method and system based on machine reading understanding | |
CN116975776A (en) | Multi-mode data fusion method and device based on tensor and mutual information | |
CN113159071B (en) | Cross-modal image-text association anomaly detection method | |
CN115620342A (en) | Cross-modal pedestrian re-identification method, system and computer | |
CN114844682A (en) | DGA domain name detection method and system | |
Rungta et al. | A deep learning based approach to measure confidence for virtual interviews |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |