CN112861945A - Multi-mode fusion lie detection method - Google Patents

Multi-mode fusion lie detection method Download PDF

Info

Publication number
CN112861945A
CN112861945A CN202110118420.XA CN202110118420A CN112861945A CN 112861945 A CN112861945 A CN 112861945A CN 202110118420 A CN202110118420 A CN 202110118420A CN 112861945 A CN112861945 A CN 112861945A
Authority
CN
China
Prior art keywords
matrix
feature
neural network
lie
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110118420.XA
Other languages
Chinese (zh)
Other versions
CN112861945B (en
Inventor
冯铃
张慧君
曹檑
丁扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110118420.XA priority Critical patent/CN112861945B/en
Publication of CN112861945A publication Critical patent/CN112861945A/en
Application granted granted Critical
Publication of CN112861945B publication Critical patent/CN112861945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a multi-mode fusion lie detection method, which is based on a long-short term memory cyclic neural network, a feedforward full-connection neural network and a convolution neural network to obtain a text, video and audio characteristic matrix; then acquiring an inter-problem association graph neural network based on the text, video and audio fusion feature matrix of each problem and the association between the problems of each individual; then based on the inter-problem association graph neural network, obtaining an association expression matrix of each problem node on the graph neural network; and finally, obtaining a classification vector for judging whether the individual lies on the problem or not based on the associated expression matrix of each problem node. The invention is suitable for processing the lie detection problem of the problem level with finer granularity than the current video level lie detection under the real court trial or trial situation.

Description

Multi-mode fusion lie detection method
Technical Field
The invention relates to the technical field of computers, in particular to a multi-mode fusion lie detection method.
Background
Lie is common in everyday life. Some lie is unbounded and not malicious. However, some lie, such as in court or criminal judgment, are destructive and have adverse effects on justice and social security. Whether these lie claims are related to sense and truth can be judged. Therefore, accurate and effective deception detection can provide crucial support for evaluating the authenticity of the testimony and determining the judgment.
Existing lie detection for real trial or interrogation situations only focuses on the video level of the lie detection task, i.e. a detection result is given for a given video segment, whether this segment contains one or more interrogation problems. However, the authenticity of the answers given to each trial question is necessary in a real trial, and the performance of the same individual's answers before and after the answers also has relevance, which should be used to improve the performance of question-level lie detection.
As can be seen from the above description, there is currently no lie detection method and apparatus that considers the question level of relevance of individual context response performance.
Object of the Invention
The invention aims to provide a multi-mode fusion lie detection method aiming at the problems in the prior art. Specifically, aiming at the lie detection task at the problem level, a multi-mode fusion lie detection method is provided, the fusion of information of text, vision and audio modes is realized, the association learning among different problems of the same individual is carried out through a graph neural network, and the association of individual behaviors of different problems is discovered, so that the lie detection performance is improved. The method mainly comprises two aspects: 1) text, visual and audio feature expression extraction and multi-mode fusion; 2) associative feature learning and lie detection classification based on graph neural networks.
Disclosure of Invention
The invention discloses a multi-mode fusion lie detection method, which comprises the following steps:
step 1, expressing, extracting and multi-modal fusing relevant characteristics of lie behaviors including texts, vision and audio, and specifically comprising the following steps of:
acquiring a lie-expression behavior related characteristic expression matrix of the text based on the long-short term memory cyclic neural network and the feedforward full-link neural network;
acquiring a lie-based behavior related characteristic expression matrix of vision based on a convolutional neural network and a long-short term memory recurrent neural network;
acquiring a lie behavior related characteristic expression matrix of the audio based on the audio processing deep neural network;
acquiring a multi-mode fusion characteristic matrix based on the acquired lie behavior related characteristic expression matrix of the text, the vision and the audio;
step 2, carrying out associated feature learning and lie detection classification based on the graph neural network, and specifically comprising the following steps:
acquiring a full-connected representation diagram based on the multi-modal fusion feature matrix acquired in the step 1, and taking the multi-modal fusion feature matrix corresponding to the single problem behavior of each user as a node;
acquiring an expression matrix corresponding to a single problem behavior of each user based on the full-connectivity expression diagram and the attention mechanism;
and acquiring whether the detected classification result lies or not based on the acquired expression matrix corresponding to the single problem behavior of each user and the feedforward fully-connected neural network.
Preferably, the obtaining of the lie-behavior related feature expression matrix of the text in step 1 specifically includes:
abstracting text into a sentence containing n words, with w ═ w1,w2,…,wnDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of words
Figure BDA0002921153320000021
Acquiring text information in the sequence by using a long-short term memory recurrent neural network (LSTM), specifically, inputting a word vector feature matrix X into an LSTM layer, wherein the LSTM layer is a bidirectional LSTM and passes through positiveTwo LSTMs in the direction and reverse direction are respectively used for each word vector xiGenerating a hidden state
Figure BDA0002921153320000031
And
Figure BDA0002921153320000032
adding the two hidden state vectors, and splicing the n hidden states to form a text representation matrix H, which is represented by the following formula (1):
Figure BDA0002921153320000033
wherein the content of the first and second substances,
Figure BDA0002921153320000034
based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained, and the weight is given to the text representation matrix, i.e. the importance weight distribution matrix Ast, by bit multiplication, as shown in equations (2) and (3):
Ast=softmax(w1H+b1) (2),
H′=H×Ast+H (3),
wherein the content of the first and second substances,
Figure BDA0002921153320000035
a distribution matrix of the importance weights is represented,
Figure BDA0002921153320000036
and
Figure BDA0002921153320000037
is a trainable parameter vector, softmax is an activation function,
Figure BDA0002921153320000038
converting H' from two dimensions to one-dimensional vectors, i.e.
Figure BDA0002921153320000039
Obtaining a text characteristic expression matrix F through a feedforward full-connection neural networkTAs shown in formula (4):
FT=ReLU(W2H′+b2) (4),
wherein
Figure BDA00029211533200000310
And
Figure BDA00029211533200000311
is a trainable vector of parameters, ReLU is an activation function,
Figure BDA00029211533200000312
preferably, n is 20.
Preferably, the acquiring a visual lie behavior related feature expression matrix in step 1 specifically includes:
abstracting a video into a sequence of images containing T consecutive frames, using V ═ frame1,frame2,…,frameTRepresents;
for each frame in V, a pre-trained model Resnet with weight sharingiA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,
Figure BDA0002921153320000041
the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and the T feature maps are spliced into a matrix, as shown in formulas (5) and (6):
Figure BDA0002921153320000042
M′=concat([M′1,M′2,…,MT′]) (6),
wherein
Figure BDA0002921153320000043
A feature map matrix representing the video;
paying attention to an image which contributes more to judging whether lie exists in an image sequence by applying an attention mechanism, and learning the time sequence change of the image in the video through a long-short term memory recurrent neural network (LSTM);
obtaining the importance distribution of each characteristic diagram in the characteristic diagram matrix by applying an attention mechanism, and obtaining the video characteristic diagram matrix with the weight redistributed according to the importance distribution by bit multiplication
Figure BDA0002921153320000044
As shown in formulas (7) and (8):
Asv=softmax(W3M′+b3) (7),
Figure BDA0002921153320000045
wherein the content of the first and second substances,
Figure BDA0002921153320000046
is the importance weight distribution matrix, softmax is an activation function,
Figure BDA0002921153320000047
and
Figure BDA0002921153320000048
is a vector of parameters that can be trained,
Figure BDA0002921153320000049
mapping a video feature map matrix
Figure BDA00029211533200000410
Input to the LSTM, as shown in equation (9):
Figure BDA0002921153320000051
wherein h isiAnd hi-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequenceTAs a time-series feature expression matrix of the video,
Figure BDA0002921153320000052
h is transmitted through a feedforward full-connection networkTMapping to FTIn the same dimensional space, as shown in equation (10):
Fv=ReLU(W4hT+b4) (10),
wherein the content of the first and second substances,
Figure BDA0002921153320000053
is a matrix of the expression of the characteristics of the video,
Figure BDA0002921153320000054
and
Figure BDA0002921153320000055
is a trainable vector of parameters and ReLU is an activation function.
Preferably, T-20.
Preferably, the obtaining of the lie-behavior related feature expression matrix of the audio in step 1 specifically includes:
abstracting audio to a continuous sequence of sound signals, with a ═ a1,a2,…,aLRepresents;
extracting feature expression from an original sound signal by using a pre-trained SincNet network to obtain DAAs shown in formula (11):
Figure BDA0002921153320000056
by feedforward fully-connected neural networks, DAMapping to FTIn the same dimensional space, as shown in equation (12):
FA=ReLU(W5DA+b5) (12),
wherein the content of the first and second substances,
Figure BDA0002921153320000057
is a matrix of the expression of the audio features,
Figure BDA0002921153320000058
and
Figure BDA0002921153320000059
is a trainable vector of parameters and ReLU is an activation function.
Preferably, in the sound signal sequence a, L is 200.
Preferably, the obtaining the multi-modal fusion feature matrix in step 1 specifically includes:
splicing the obtained text, video and audio feature expression matrixes together to obtain a multi-mode feature matrix FM', as shown in formula (13):
FM′=concat([FT,FV,FA]) (13),
wherein
Figure BDA0002921153320000061
Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix FMAs shown in formulas (14) and (15):
Amm=softmax(W6FM′+b6) (14),
FM=Amm×FM′+FM′ (15),
wherein the content of the first and second substances,
Figure BDA0002921153320000062
is the importance weightThe distribution matrix, softmax, is an activation function,
Figure BDA0002921153320000063
and
Figure BDA0002921153320000064
is a vector of parameters that can be trained,
Figure BDA0002921153320000065
is a multi-modal fusion feature matrix.
Preferably, the learning of the associated features based on the graph neural network in step 2 specifically includes:
representing the set of user individuals as O ═ { O ═ O1,o2,…,omAnd (c) the step of (c) in which,
Figure BDA0002921153320000066
oiis an individual, each individual containing responses to a number of questions asked, i.e. responses
Figure BDA0002921153320000067
This graph neural network is defined as G (U, E), where U represents the set of problem nodes,
Figure BDA0002921153320000068
that is, each problem node belongs to one of the O units;
each problem node is represented by a multi-mode fusion feature matrix obtained by the behavior corresponding to the problem, E represents a set of edges, and if any two nodes in U satisfy Ui∈ogAnd u isj∈og(ogE is equal to O), then eij=1,eji=1;
For problem uiIs represented by a vector
Figure BDA0002921153320000071
Applying neighbour influences, i.e. the same individual to a questionThe performance of questions can be influenced by responses of other questions, and the responses of the same individual to different questions are similar;
through a fully connected layer, from
Figure BDA0002921153320000072
To find out any problem uiRepresents the initial hidden state of the vector
Figure BDA0002921153320000073
As shown in equation (16):
Figure BDA0002921153320000074
wherein
Figure BDA0002921153320000075
Is a trainable parameter;
for each problem pair (u) with an edgei,uj) It is expressed as a vector
Figure BDA0002921153320000076
And
Figure BDA0002921153320000077
splicing and obtaining attention factor by using a full connection layer
Figure BDA0002921153320000078
As shown in equation (17):
Figure BDA0002921153320000079
wherein
Figure BDA00029211533200000710
Is a trainable parameter;
for ui∈ogTo obtain uiVector of neighbor attention coefficients
Figure BDA00029211533200000711
As shown in equation (18):
Figure BDA00029211533200000712
wherein, | ogIs u |iThe number of adjacent neighbors;
obtaining each neighbor node pair problem u through softmax functioniInfluence exerted
Figure BDA00029211533200000713
As shown in equation (19):
Figure BDA00029211533200000714
will be provided with
Figure BDA00029211533200000715
Is shown as
Figure BDA00029211533200000716
By passing
Figure BDA00029211533200000717
Updating uiIs in a hidden state of
Figure BDA00029211533200000718
As shown in equation (20):
Figure BDA00029211533200000719
where sigma is an activation function sigmoid,
Figure BDA00029211533200000720
obtaining a final expression matrix for a node problem using a layer of fully-connected network
Figure BDA00029211533200000721
As shown in equation (21):
Figure BDA0002921153320000081
wherein
Figure BDA0002921153320000082
Are trainable parameters.
Preferably, the lie detection classification based on the graph neural network in step 2 specifically includes:
obtaining a classification result by using a feedforward fully-connected neural network, as shown in formula (22):
Figure BDA0002921153320000083
wherein y is0,y1Indicates to problem uiWith or without the occurrence of fraudulent activity,
Figure BDA0002921153320000084
are trainable parameters.
Preferably, the lie detection method is used for lie detection in a real court trial or trial situation.
Drawings
FIG. 1 is a diagram illustrating the extraction of expression and multi-modal fusion of relevant features including lie behavior of text, vision, and audio according to the present invention.
FIG. 2 is a flow chart of associative feature learning and lie detection classification based on graph neural networks according to the present invention.
Detailed Description
The lie detection method for problem-level multi-modal fusion according to the present invention is described in detail below with reference to the accompanying drawings.
Aiming at the lie detection task at the problem level, the invention provides a multi-mode fusion method, which realizes the fusion of information of text, vision and audio modes, simultaneously performs the association learning among different problems of the same individual through a graph neural network, and finds out the association of individual behaviors of different problems so as to improve the lie detection performance. The method mainly comprises the following two steps: 1) text, visual and audio feature expression extraction and multi-mode fusion; 2) associative feature learning and lie detection classification based on graph neural networks.
FIG. 1 is a diagram illustrating the extraction of expression and multi-modal fusion of relevant features including lie behavior of text, vision, and audio according to the present invention. As shown in the figure, relevant features of lie behaviors including texts, videos and audios are expressed and extracted to form corresponding matrixes, and multi-mode fusion feature matrixes are obtained after multi-mode fusion.
The related specific process is as follows:
feature expression extraction and multimodal fusion
The goal of this section is to obtain the three modality feature expression matrices and the fusion of the three modality feature expression matrices, respectively.
1. Text feature expression extraction
Abstracting text into a sentence containing n words, with w ═ w1,w2,…,wnDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of words
Figure BDA0002921153320000091
In the present invention, n is 20.
To learn lie-related text feature expressions from semantic relations, a long-short term memory recurrent neural network LSTM is used to obtain text information in a sequence. In addition to this, the focus on the more important words is based on an attention mechanism, taking into account that different words in the sentence contribute differently to the detection result.
First, the word vector feature matrix X is input into the LSTM layer, using bi-directional LSTM, each word vector X can be represented by two LSTMs in forward and backward directionsiGenerating a hidden state
Figure BDA0002921153320000092
And
Figure BDA0002921153320000093
adding two hidden state vectors, and splicing n hidden states to form a text representation matrix H
Figure BDA0002921153320000094
Wherein the content of the first and second substances,
Figure BDA0002921153320000095
based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained and is given to the text representation matrix through bit multiplication:
Ast=softmax(W1H+b1)
H′=H×Ast+H
wherein the content of the first and second substances,
Figure BDA0002921153320000096
a distribution matrix of the importance weights is represented,
Figure BDA0002921153320000097
and
Figure BDA0002921153320000098
is a trainable parameter vector, softmax is an activation function,
Figure BDA0002921153320000101
converting H' from two-dimensional to one-dimensional vector
Figure BDA0002921153320000102
Obtaining a text characteristic expression matrix F through a feedforward full-connection neural networkT
FT=ReLU[W2H′+b2)
Wherein
Figure BDA0002921153320000103
And
Figure BDA0002921153320000104
is a trainable vector of parameters, ReLU is an activation function,
Figure BDA0002921153320000105
2. visual feature expression extraction
Abstracting a video into a sequence of images containing T consecutive frames, using v ═ frame1,frame2,…,farmeTRepresents it. In the present invention, T is 20. First, for each frame in V, through a weight-shared pre-training model ResnetiA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,
Figure BDA0002921153320000106
the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and T feature maps are spliced into a matrix:
Figure BDA0002921153320000107
M′=concat[[M′1,M′2,…,MT′])
wherein
Figure BDA0002921153320000108
A feature map matrix representing the video.
To model the timing information in video, a mechanism of attention is applied to pay attention to the images in the image sequence that contribute more to determining whether to lie, and the timing variation of the images in the video is learned through the long-short term memory recurrent neural network LSTM.
Because each graph is contained in the characteristic graph matrix M' of the videoLike the corresponding feature map matrix, therefore, firstly, the attention mechanism is applied to obtain the importance distribution of each feature map in the feature map matrix, and the video feature map matrix with the weight redistributed according to the importance distribution is obtained by the counterpoint multiplication
Figure BDA0002921153320000111
Asv=softmax(W3M′+b3)
Figure BDA0002921153320000112
Wherein the content of the first and second substances,
Figure BDA0002921153320000113
is the importance weight distribution matrix, softmax is an activation function,
Figure BDA0002921153320000114
and
Figure BDA0002921153320000115
is a vector of parameters that can be trained,
Figure BDA0002921153320000116
video feature map matrix
Figure BDA0002921153320000117
Input into LSTM:
Figure BDA0002921153320000118
wherein h isiAnd hi-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequenceTAs a time-series feature expression matrix of the video,
Figure BDA0002921153320000119
finally, h is fed through a feed-forward fully-connected networkTMapping to FTIn the same dimensional space:
FV=ReLU(W4hT+b4)
wherein the content of the first and second substances,
Figure BDA00029211533200001110
is a matrix of the expression of the characteristics of the video,
Figure BDA00029211533200001111
and
Figure BDA00029211533200001112
is a trainable vector of parameters and ReLU is an activation function.
3. Audio feature expression extraction
A ═ a is used for a sequence for abstracting audio into a continuous sound signal1,a2,…,aLRepresents it. In the present invention, L is 200. Extracting feature expression from original sound signals by using a pre-trained SincNet network to obtain
Figure BDA00029211533200001113
D is generated by feedforward fully-connected neural networkAMapping to FTIn the same dimensional space:
FA=ReLU(W5DA+b5)
wherein the content of the first and second substances,
Figure BDA00029211533200001114
is a matrix of the expression of the audio features,
Figure BDA00029211533200001115
and
Figure BDA00029211533200001116
is a trainable vector of parameters and ReLU is an activation function.
4. Feature expression matrix fusion
Splicing the text, video and audio feature expression matrixes obtained above to obtain a multi-mode feature matrix FM′:
FM′=concat([FT,FV,FA])
Wherein
Figure BDA0002921153320000121
Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix FM
Amm=softmax(W6FM′+b6)
FM=Amm×FM′+FM
Wherein the content of the first and second substances,
Figure BDA0002921153320000122
is the importance weight distribution matrix, softmax is an activation function,
Figure BDA0002921153320000123
and
Figure BDA0002921153320000124
is a vector of parameters that can be trained,
Figure BDA0002921153320000125
is a multi-modal fusion feature matrix.
Fig. 2 is a flowchart of the learning of associated features and lie detection classification based on the graph neural network according to the present invention, and it can be seen from the figure that, for the obtained multi-modal fusion feature matrix, the learning of associated features and the lie detection classification based on the graph neural network are performed to determine whether an individual lies, and the process is specifically described as follows:
associative feature learning and lie detection classification based on graph neural networks
The aim of the part is to learn the relevance of behavior performances corresponding to different problems by constructing a graph neural network and classify whether the questions lie or not.
1. Associative feature learning based on graph neural networks
First, how to construct the problem map is described. Representing the set of user individuals as O ═ { O ═ O1,o2,…,omAnd (c) the step of (c) in which,
Figure BDA0002921153320000126
oiis an individual, each individual possibly containing responses to several questions asked, i.e.
Figure BDA0002921153320000131
This graph is defined as G (U, E), where U represents the set of problem nodes,
Figure BDA0002921153320000132
that is, for each problem node belonging to one individual in O, the problem node is represented by the multi-modal fusion feature matrix obtained by the behavior corresponding to the problem, E represents the set of edges, and if any two nodes in U satisfy Ui∈ogAnd u isj∈og(ogE.o), then eij=1,eji=1。
For problem uiIs represented by a vector
Figure BDA0002921153320000133
Neighbor influence is applied, that is, the performance of a certain question by the same individual is influenced by the responses of other questions, and the responses of the same individual to different questions have similarity. Considering different problem pairs uiThe different effects of the responsive performance employ a mechanism of attention at adjacent question levels.
First, through a full link layer, from
Figure BDA0002921153320000134
To find out any problem uiRepresents the initial hidden state of the vector
Figure BDA0002921153320000135
Figure BDA0002921153320000136
Wherein
Figure BDA0002921153320000137
Are trainable parameters.
For each problem pair (u) with an edgei,uj) It is expressed as a vector
Figure BDA0002921153320000138
And
Figure BDA0002921153320000139
splicing and obtaining attention factor by using a full connection layer
Figure BDA00029211533200001310
Figure BDA00029211533200001311
Wherein
Figure BDA00029211533200001312
Are trainable parameters.
For ui∈ogIn this way, a vector of the neighbor attention coefficients of ui is obtained:
Figure BDA00029211533200001313
wherein, | ogIs u |iThe number of adjacent neighbors is obtained, and each neighbor node pair problem u is obtained through a softmax functioniInfluence exerted:
Figure BDA0002921153320000141
Will be provided with
Figure BDA0002921153320000142
Is shown as
Figure BDA0002921153320000143
By passing
Figure BDA0002921153320000144
Update uiIs in a hidden state of
Figure BDA0002921153320000145
Figure BDA0002921153320000146
Finally, a layer of fully-connected network is used for obtaining a final expression matrix of the node problem
Figure BDA0002921153320000147
Figure BDA0002921153320000148
Wherein
Figure BDA0002921153320000149
Are trainable parameters.
Lie detection classification
Using a feed-forward fully-connected neural network to obtain classification results:
Figure BDA00029211533200001410
wherein y is0,y1To representFor problem uiWith or without the occurrence of fraudulent activity,
Figure BDA00029211533200001411
are trainable parameters.
The invention has the advantages of
The invention redefines the existing lie detection task by constructing a graph neural network associated with the question, considers the relevance of the performances of the same individual when answering the question, and simultaneously makes up the defect of single data source lie detection by a multi-mode fusion mode.

Claims (11)

1. A multi-modal fusion lie detection method is characterized by comprising the following steps:
step 1, expressing, extracting and multi-modal fusing relevant characteristics of lie behaviors including texts, vision and audio, and specifically comprising the following steps of:
acquiring a lie-expression behavior related characteristic expression matrix of the text based on the long-short term memory cyclic neural network and the feedforward full-link neural network;
acquiring a lie-based behavior related characteristic expression matrix of vision based on a convolutional neural network and a long-short term memory recurrent neural network;
acquiring a lie behavior related characteristic expression matrix of the audio based on the audio processing deep neural network;
acquiring a multi-mode fusion characteristic matrix based on the acquired lie behavior related characteristic expression matrix of the text, the vision and the audio;
step 2, carrying out associated feature learning and lie detection classification based on the graph neural network, and specifically comprising the following steps:
acquiring a full-connected representation diagram based on the multi-modal fusion feature matrix acquired in the step 1, and taking the multi-modal fusion feature matrix corresponding to the single problem behavior of each user as a node;
acquiring an expression matrix corresponding to a single problem behavior of each user based on the full-connectivity expression diagram and the attention mechanism;
and acquiring whether the detected classification result lies or not based on the acquired expression matrix corresponding to the single problem behavior of each user and the feedforward fully-connected neural network.
2. The multi-modal fusion lie detection method according to claim 1, wherein the obtaining of the lie behavior related feature expression matrix of the text in step 1 specifically comprises:
abstracting text into a sentence containing n words, with w ═ w1,w2,…,wnDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of words
Figure FDA0002921153310000021
Acquiring text information in a sequence by using a long-short term memory recurrent neural network (LSTM), specifically, inputting a word vector feature matrix X into an LSTM layer, wherein the LSTM layer is a bidirectional LSTM, and each word vector X is respectively obtained by a forward LSTM and a backward LSTMiGenerating a hidden state
Figure FDA0002921153310000022
And
Figure FDA0002921153310000023
adding the two hidden state vectors, and splicing the n hidden states to form a text representation matrix H, which is represented by the following formula (1):
Figure FDA0002921153310000024
wherein the content of the first and second substances,
Figure FDA0002921153310000025
based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained, and the weight is given to the text representation matrix, i.e. the importance weight distribution matrix Ast, by bit multiplication, as shown in equations (2) and (3):
Ast=softmax(W1H+b1) (2),
H′=H×Ast+H (3),
wherein the content of the first and second substances,
Figure FDA0002921153310000026
a distribution matrix of the importance weights is represented,
Figure FDA0002921153310000027
and
Figure FDA0002921153310000028
is a trainable parameter vector, softmax is an activation function,
Figure FDA0002921153310000029
converting H' from two dimensions to one-dimensional vectors, i.e.
Figure FDA00029211533100000210
Obtaining a text characteristic expression matrix F through a feedforward full-connection neural networkTAs shown in formula (4):
FT=ReLU(W2H′+b2) (4),
wherein
Figure FDA0002921153310000031
And
Figure FDA0002921153310000032
is a trainable vector of parameters, ReLU is an activation function,
Figure FDA0002921153310000033
3. a multi-modal fusion lie detection method according to claim 2, wherein n-20.
4. The multi-modal fusion lie detection method according to claim 3, wherein the obtaining of the visual lie behavior related feature expression matrix in step 1 specifically comprises:
abstracting a video into a sequence of images containing T consecutive frames, using V ═ frame1,frame2,…,frameTRepresents;
for each frame in V, a pre-trained model Resnet with weight sharingiA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,
Figure FDA0002921153310000034
the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and the T feature maps are spliced into a matrix, as shown in formulas (5) and (6):
Figure FDA0002921153310000035
M′=concat([M′1,M′2,…,MT′]) (6),
wherein
Figure FDA0002921153310000036
A feature map matrix representing the video;
paying attention to an image which contributes more to judging whether lie exists in an image sequence by applying an attention mechanism, and learning the time sequence change of the image in the video through a long-short term memory recurrent neural network (LSTM);
obtaining the importance distribution of each characteristic diagram in the characteristic diagram matrix by applying an attention mechanism, and obtaining the video characteristic diagram matrix with the weight redistributed according to the importance distribution by bit multiplication
Figure FDA0002921153310000041
As shown in formulas (7) and (8):
Asv=softmax(W3M′+b3) (7),
Figure FDA0002921153310000042
wherein the content of the first and second substances,
Figure FDA0002921153310000043
is the importance weight distribution matrix, softmax is an activation function,
Figure FDA0002921153310000044
and
Figure FDA0002921153310000045
is a vector of parameters that can be trained,
Figure FDA0002921153310000046
mapping a video feature map matrix
Figure FDA0002921153310000047
Input to the LSTM, as shown in equation (9):
Figure FDA0002921153310000048
wherein h isiAnd hi-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequenceTAs a time-series feature expression matrix of the video,
Figure FDA0002921153310000049
h is transmitted through a feedforward full-connection networkTMapping to FTIn the same dimensional space, as shown in equation (10):
FV=ReLU(W4hT+b4) (10),
wherein the content of the first and second substances,
Figure FDA00029211533100000410
is a matrix of the expression of the characteristics of the video,
Figure FDA00029211533100000411
and
Figure FDA00029211533100000412
is a trainable vector of parameters and ReLU is an activation function.
5. The method of claim 4, wherein T-20.
6. The multi-modal fusion lie detection method according to claim 5, wherein the obtaining of the lie behavior related feature expression matrix of the audio in step 1 specifically comprises:
abstracting audio to a continuous sequence of sound signals, with a ═ a1,a2,…,aLRepresents;
extracting feature expression from an original sound signal by using a pre-trained SincNet network to obtain DAAs shown in formula (11):
Figure FDA0002921153310000051
by feedforward fully-connected neural networks, DAMapping to FTIn the same dimensional space, as shown in equation (12):
FA=ReLU(W5DA+b5) (12),
wherein the content of the first and second substances,
Figure FDA0002921153310000052
is a matrix of the expression of the audio features,
Figure FDA0002921153310000053
and
Figure FDA0002921153310000054
is a trainable vector of parameters and ReLU is an activation function.
7. The method of claim 6, wherein L is 200 in the sound signal sequence A.
8. The method for detecting the multi-modal fusion lie according to claim 7, wherein the obtaining the multi-modal fusion feature matrix in step 1 specifically includes:
splicing the obtained text, video and audio feature expression matrixes together to obtain a multi-mode feature matrix FM', as shown in formula (13):
FM′=concat([FT,FV,FA]) (13),
wherein
Figure FDA0002921153310000055
Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix FMAs shown in formulas (14) and (15):
Amm=softmax(W6FM′+b6) (14),
Fm=Amm×FM′+FM′ (15),
wherein the content of the first and second substances,
Figure FDA0002921153310000056
is the importance weight distribution matrix, softmax is an activation function,
Figure FDA0002921153310000057
and
Figure FDA0002921153310000058
is a vector of parameters that can be trained,
Figure FDA0002921153310000059
is a multi-modal fusion feature matrix.
9. The multi-modal fusion lie detection method according to claim 8, wherein the performing associated feature learning based on the graph neural network in step 2 specifically comprises:
representing the set of user individuals as O ═ { O ═ O1,o2,…,omAnd (c) the step of (c) in which,
Figure FDA0002921153310000061
oiis an individual, each individual containing responses to a number of questions asked, i.e. responses
Figure FDA0002921153310000062
This graph neural network is defined as G (U, E), where U represents the set of problem nodes,
Figure FDA0002921153310000063
that is, each problem node belongs to one of the O units;
each problem node is represented by a multi-mode fusion feature matrix obtained by the behavior corresponding to the problem, E represents a set of edges, and if any two nodes in U satisfy Ui∈ogAnd u isj∈og(ogE is equal to O), then eij=1,eji=1;
For problem uiIs represented by a vector
Figure FDA0002921153310000064
Applying neighbor influence, namely the performance of a certain question by the same individual is influenced by the responses of other questions, and the responses of the same individual to different questions have similarity;
through a fully connected layer, from
Figure FDA0002921153310000065
To find out any problem uiRepresents the initial hidden state of the vector
Figure FDA0002921153310000066
As shown in equation (16):
Figure FDA0002921153310000067
wherein
Figure FDA0002921153310000068
Is a trainable parameter;
for each problem pair (u) with an edgei,uj) It is expressed as a vector
Figure FDA0002921153310000069
And
Figure FDA00029211533100000610
splicing and obtaining attention factor by using a full connection layer
Figure FDA00029211533100000611
As shown in equation (17):
Figure FDA00029211533100000612
wherein
Figure FDA00029211533100000613
Is a trainable parameter;
for ui∈ogTo obtain uiVector of neighbor attention coefficients
Figure FDA00029211533100000614
As shown in equation (18):
Figure FDA0002921153310000071
wherein, | ogIs u |iThe number of adjacent neighbors;
obtaining each neighbor node pair problem u through softmax functioniInfluence exerted
Figure FDA0002921153310000072
As shown in equation (19):
Figure FDA0002921153310000073
will be provided with
Figure FDA0002921153310000074
Is shown as
Figure FDA0002921153310000075
By passing
Figure FDA0002921153310000076
Updating uiIs in a hidden state of
Figure FDA0002921153310000077
As shown in equation (20):
Figure FDA0002921153310000078
where sigma is an activation function sigmoid,
Figure FDA0002921153310000079
obtaining a final expression matrix for a node problem using a layer of fully-connected network
Figure FDA00029211533100000710
As shown in equation (21):
Figure FDA00029211533100000711
wherein
Figure FDA00029211533100000712
Are trainable parameters.
10. The multi-modal fusion lie detection method according to claim 9, wherein the lie detection classification based on the graph neural network in step 2 specifically includes:
obtaining a classification result by using a feedforward fully-connected neural network, as shown in formula (22):
Figure FDA00029211533100000713
wherein y is0,y1Indicates to problem uiWith or without the occurrence of fraudulent activity,
Figure FDA00029211533100000714
Figure FDA00029211533100000715
is a trainable referenceAnd (4) counting.
11. A multi-modal fusion lie detection method according to any of claims 1-10, used for lie detection in real court trial or trial situations.
CN202110118420.XA 2021-01-28 2021-01-28 Multi-mode fusion lie detection method Active CN112861945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110118420.XA CN112861945B (en) 2021-01-28 2021-01-28 Multi-mode fusion lie detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110118420.XA CN112861945B (en) 2021-01-28 2021-01-28 Multi-mode fusion lie detection method

Publications (2)

Publication Number Publication Date
CN112861945A true CN112861945A (en) 2021-05-28
CN112861945B CN112861945B (en) 2022-05-13

Family

ID=75987679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110118420.XA Active CN112861945B (en) 2021-01-28 2021-01-28 Multi-mode fusion lie detection method

Country Status (1)

Country Link
CN (1) CN112861945B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240056A (en) * 2021-07-12 2021-08-10 北京百度网讯科技有限公司 Multi-mode data joint learning model training method and device
CN113869276A (en) * 2021-10-15 2021-12-31 山东大学 Lie recognition method and system based on micro-expression
CN114881668A (en) * 2022-05-05 2022-08-09 灵图数据(杭州)有限公司 Multi-mode-based deception detection method
WO2023036159A1 (en) * 2021-09-07 2023-03-16 Huawei Technologies Co., Ltd. Methods and devices for audio visual event localization based on dual perspective networks
CN116522212A (en) * 2023-07-05 2023-08-01 清华大学 Lie detection method, device, equipment and medium based on image text fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189572A1 (en) * 2016-12-30 2018-07-05 Mitsubishi Electric Research Laboratories, Inc. Method and System for Multi-Modal Fusion Model
CN110301920A (en) * 2019-06-27 2019-10-08 清华大学 Multi-modal fusion method and device for psychological pressure detection
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
US20200335092A1 (en) * 2019-04-20 2020-10-22 Behavioral Signal Technologies, Inc. Deep hierarchical fusion for machine intelligence applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189572A1 (en) * 2016-12-30 2018-07-05 Mitsubishi Electric Research Laboratories, Inc. Method and System for Multi-Modal Fusion Model
US20200335092A1 (en) * 2019-04-20 2020-10-22 Behavioral Signal Technologies, Inc. Deep hierarchical fusion for machine intelligence applications
CN110301920A (en) * 2019-06-27 2019-10-08 清华大学 Multi-modal fusion method and device for psychological pressure detection
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DONGLI WANG,ETC.: "Human action recognition based on multi-mode spatial-temporal feature fusion", 《22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION》 *
胡骏: "社会多媒体网络表示学习研究", 《中国优秀博士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240056A (en) * 2021-07-12 2021-08-10 北京百度网讯科技有限公司 Multi-mode data joint learning model training method and device
WO2023036159A1 (en) * 2021-09-07 2023-03-16 Huawei Technologies Co., Ltd. Methods and devices for audio visual event localization based on dual perspective networks
CN113869276A (en) * 2021-10-15 2021-12-31 山东大学 Lie recognition method and system based on micro-expression
CN114881668A (en) * 2022-05-05 2022-08-09 灵图数据(杭州)有限公司 Multi-mode-based deception detection method
CN116522212A (en) * 2023-07-05 2023-08-01 清华大学 Lie detection method, device, equipment and medium based on image text fusion
CN116522212B (en) * 2023-07-05 2023-09-26 清华大学 Lie detection method, device, equipment and medium based on image text fusion

Also Published As

Publication number Publication date
CN112861945B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN112861945B (en) Multi-mode fusion lie detection method
KR102071582B1 (en) Method and apparatus for classifying a class to which a sentence belongs by using deep neural network
CN109583501B (en) Method, device, equipment and medium for generating image classification and classification recognition model
Alkhulaifi et al. Knowledge distillation in deep learning and its applications
CN111597830A (en) Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111368074A (en) Link prediction method based on network structure and text information
US20200134455A1 (en) Apparatus and method for training deep learning model
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN111582397A (en) CNN-RNN image emotion analysis method based on attention mechanism
CN111178319A (en) Video behavior identification method based on compression reward and punishment mechanism
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN111598118A (en) Visual question-answering task implementation method and system
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN113988079A (en) Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method
CN112131345A (en) Text quality identification method, device, equipment and storage medium
CN113849599B (en) Joint false news detection method based on mode information and fact information
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
CN116206227B (en) Picture examination system and method for 5G rich media information, electronic equipment and medium
CN117349402A (en) Emotion cause pair identification method and system based on machine reading understanding
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
CN113159071B (en) Cross-modal image-text association anomaly detection method
CN115620342A (en) Cross-modal pedestrian re-identification method, system and computer
CN114844682A (en) DGA domain name detection method and system
Rungta et al. A deep learning based approach to measure confidence for virtual interviews

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant