CN112861945A

CN112861945A - Multi-mode fusion lie detection method

Info

Publication number: CN112861945A
Application number: CN202110118420.XA
Authority: CN
Inventors: 冯铃; 张慧君; 曹檑; 丁扬
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-05-28
Anticipated expiration: 2041-01-28
Also published as: CN112861945B

Abstract

The invention discloses a multi-mode fusion lie detection method, which is based on a long-short term memory cyclic neural network, a feedforward full-connection neural network and a convolution neural network to obtain a text, video and audio characteristic matrix; then acquiring an inter-problem association graph neural network based on the text, video and audio fusion feature matrix of each problem and the association between the problems of each individual; then based on the inter-problem association graph neural network, obtaining an association expression matrix of each problem node on the graph neural network; and finally, obtaining a classification vector for judging whether the individual lies on the problem or not based on the associated expression matrix of each problem node. The invention is suitable for processing the lie detection problem of the problem level with finer granularity than the current video level lie detection under the real court trial or trial situation.

Description

Multi-mode fusion lie detection method

Technical Field

The invention relates to the technical field of computers, in particular to a multi-mode fusion lie detection method.

Background

Lie is common in everyday life. Some lie is unbounded and not malicious. However, some lie, such as in court or criminal judgment, are destructive and have adverse effects on justice and social security. Whether these lie claims are related to sense and truth can be judged. Therefore, accurate and effective deception detection can provide crucial support for evaluating the authenticity of the testimony and determining the judgment.

Existing lie detection for real trial or interrogation situations only focuses on the video level of the lie detection task, i.e. a detection result is given for a given video segment, whether this segment contains one or more interrogation problems. However, the authenticity of the answers given to each trial question is necessary in a real trial, and the performance of the same individual's answers before and after the answers also has relevance, which should be used to improve the performance of question-level lie detection.

As can be seen from the above description, there is currently no lie detection method and apparatus that considers the question level of relevance of individual context response performance.

Object of the Invention

The invention aims to provide a multi-mode fusion lie detection method aiming at the problems in the prior art. Specifically, aiming at the lie detection task at the problem level, a multi-mode fusion lie detection method is provided, the fusion of information of text, vision and audio modes is realized, the association learning among different problems of the same individual is carried out through a graph neural network, and the association of individual behaviors of different problems is discovered, so that the lie detection performance is improved. The method mainly comprises two aspects: 1) text, visual and audio feature expression extraction and multi-mode fusion; 2) associative feature learning and lie detection classification based on graph neural networks.

Disclosure of Invention

The invention discloses a multi-mode fusion lie detection method, which comprises the following steps:

step 1, expressing, extracting and multi-modal fusing relevant characteristics of lie behaviors including texts, vision and audio, and specifically comprising the following steps of:

acquiring a lie-expression behavior related characteristic expression matrix of the text based on the long-short term memory cyclic neural network and the feedforward full-link neural network;

acquiring a lie-based behavior related characteristic expression matrix of vision based on a convolutional neural network and a long-short term memory recurrent neural network;

acquiring a lie behavior related characteristic expression matrix of the audio based on the audio processing deep neural network;

acquiring a multi-mode fusion characteristic matrix based on the acquired lie behavior related characteristic expression matrix of the text, the vision and the audio;

step 2, carrying out associated feature learning and lie detection classification based on the graph neural network, and specifically comprising the following steps:

acquiring a full-connected representation diagram based on the multi-modal fusion feature matrix acquired in the step 1, and taking the multi-modal fusion feature matrix corresponding to the single problem behavior of each user as a node;

acquiring an expression matrix corresponding to a single problem behavior of each user based on the full-connectivity expression diagram and the attention mechanism;

and acquiring whether the detected classification result lies or not based on the acquired expression matrix corresponding to the single problem behavior of each user and the feedforward fully-connected neural network.

Preferably, the obtaining of the lie-behavior related feature expression matrix of the text in step 1 specifically includes:

abstracting text into a sentence containing n words, with w ═ w₁，w₂，…，w_nDenotes that each word in the text w is encoded as a 768-dimensional word vector, i.e. by means of a pre-trained model Bert, i.e. a vector of words

Acquiring text information in the sequence by using a long-short term memory recurrent neural network (LSTM), specifically, inputting a word vector feature matrix X into an LSTM layer, wherein the LSTM layer is a bidirectional LSTM and passes through positiveTwo LSTMs in the direction and reverse direction are respectively used for each word vector x_iGenerating a hidden state

And

adding the two hidden state vectors, and splicing the n hidden states to form a text representation matrix H, which is represented by the following formula (1):

wherein the content of the first and second substances,

based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained, and the weight is given to the text representation matrix, i.e. the importance weight distribution matrix Ast, by bit multiplication, as shown in equations (2) and (3):

A_st＝softmax(w₁H+b₁) (2),

H′＝H×A_st+H (3),

wherein the content of the first and second substances,

a distribution matrix of the importance weights is represented,

and

is a trainable parameter vector, softmax is an activation function,

converting H' from two dimensions to one-dimensional vectors, i.e.

Obtaining a text characteristic expression matrix F through a feedforward full-connection neural network_TAs shown in formula (4):

F_T＝ReLU(W₂H′+b₂) (4),

wherein

And

is a trainable vector of parameters, ReLU is an activation function,

preferably, n is 20.

Preferably, the acquiring a visual lie behavior related feature expression matrix in step 1 specifically includes:

abstracting a video into a sequence of images containing T consecutive frames, using V ═ frame₁，frame₂，…，frame_TRepresents;

for each frame in V, a pre-trained model Resnet with weight sharing_iA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,

the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and the T feature maps are spliced into a matrix, as shown in formulas (5) and (6):

M′＝concat([M′₁，M′₂，…，M_T′]) (6),

wherein

A feature map matrix representing the video;

paying attention to an image which contributes more to judging whether lie exists in an image sequence by applying an attention mechanism, and learning the time sequence change of the image in the video through a long-short term memory recurrent neural network (LSTM);

obtaining the importance distribution of each characteristic diagram in the characteristic diagram matrix by applying an attention mechanism, and obtaining the video characteristic diagram matrix with the weight redistributed according to the importance distribution by bit multiplication

As shown in formulas (7) and (8):

A_sv＝softmax(W₃M′+b₃) (7),

wherein the content of the first and second substances,

is the importance weight distribution matrix, softmax is an activation function,

and

is a vector of parameters that can be trained,

mapping a video feature map matrix

Input to the LSTM, as shown in equation (9):

wherein h is_iAnd h_i-1Respectively representing the hidden vectors expressed by the ith and (i-1) th feature maps, and taking the last hidden vector h in the last sequence_TAs a time-series feature expression matrix of the video,

h is transmitted through a feedforward full-connection network_TMapping to F_TIn the same dimensional space, as shown in equation (10):

F_v＝ReLU(W₄h_T+b₄) (10)，

wherein the content of the first and second substances,

is a matrix of the expression of the characteristics of the video,

and

is a trainable vector of parameters and ReLU is an activation function.

Preferably, T-20.

Preferably, the obtaining of the lie-behavior related feature expression matrix of the audio in step 1 specifically includes:

abstracting audio to a continuous sequence of sound signals, with a ═ a₁，a₂，…，a_LRepresents;

extracting feature expression from an original sound signal by using a pre-trained SincNet network to obtain D_AAs shown in formula (11):

by feedforward fully-connected neural networks, D_AMapping to F_TIn the same dimensional space, as shown in equation (12):

F_A＝ReLU(W₅D_A+b₅) (12)，

wherein the content of the first and second substances,

is a matrix of the expression of the audio features,

and

is a trainable vector of parameters and ReLU is an activation function.

Preferably, in the sound signal sequence a, L is 200.

Preferably, the obtaining the multi-modal fusion feature matrix in step 1 specifically includes:

splicing the obtained text, video and audio feature expression matrixes together to obtain a multi-mode feature matrix F_M', as shown in formula (13):

F_M′＝concat([F_T，F_V，F_A]) (13),

wherein

Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix F_MAs shown in formulas (14) and (15):

A_mm＝softmax(W₆F_M′+b₆) (14),

F_M＝A_mm×F_M′+F_M′ (15),

wherein the content of the first and second substances,

is the importance weightThe distribution matrix, softmax, is an activation function,

and

is a vector of parameters that can be trained,

is a multi-modal fusion feature matrix.

Preferably, the learning of the associated features based on the graph neural network in step 2 specifically includes:

representing the set of user individuals as O ═ { O ═ O₁，o₂，…，o_mAnd (c) the step of (c) in which,

o_iis an individual, each individual containing responses to a number of questions asked, i.e. responses

This graph neural network is defined as G (U, E), where U represents the set of problem nodes,

that is, each problem node belongs to one of the O units;

each problem node is represented by a multi-mode fusion feature matrix obtained by the behavior corresponding to the problem, E represents a set of edges, and if any two nodes in U satisfy U_i∈o_gAnd u is_j∈o_g(o_gE is equal to O), then e_ij＝1，e_ji＝1；

For problem u_iIs represented by a vector

Applying neighbour influences, i.e. the same individual to a questionThe performance of questions can be influenced by responses of other questions, and the responses of the same individual to different questions are similar;

through a fully connected layer, from

To find out any problem u_iRepresents the initial hidden state of the vector

As shown in equation (16):

wherein

Is a trainable parameter;

for each problem pair (u) with an edge_i，u_j) It is expressed as a vector

And

splicing and obtaining attention factor by using a full connection layer

As shown in equation (17):

wherein

Is a trainable parameter;

for u_i∈o_gTo obtain u_iVector of neighbor attention coefficients

As shown in equation (18):

wherein, | o_gIs u |_iThe number of adjacent neighbors;

obtaining each neighbor node pair problem u through softmax function_iInfluence exerted

As shown in equation (19):

will be provided with

Is shown as

By passing

Updating u_iIs in a hidden state of

As shown in equation (20):

where sigma is an activation function sigmoid,

obtaining a final expression matrix for a node problem using a layer of fully-connected network

As shown in equation (21):

wherein

Are trainable parameters.

Preferably, the lie detection classification based on the graph neural network in step 2 specifically includes:

obtaining a classification result by using a feedforward fully-connected neural network, as shown in formula (22):

wherein y is₀，y₁Indicates to problem u_iWith or without the occurrence of fraudulent activity,

are trainable parameters.

Preferably, the lie detection method is used for lie detection in a real court trial or trial situation.

Drawings

FIG. 1 is a diagram illustrating the extraction of expression and multi-modal fusion of relevant features including lie behavior of text, vision, and audio according to the present invention.

FIG. 2 is a flow chart of associative feature learning and lie detection classification based on graph neural networks according to the present invention.

Detailed Description

The lie detection method for problem-level multi-modal fusion according to the present invention is described in detail below with reference to the accompanying drawings.

Aiming at the lie detection task at the problem level, the invention provides a multi-mode fusion method, which realizes the fusion of information of text, vision and audio modes, simultaneously performs the association learning among different problems of the same individual through a graph neural network, and finds out the association of individual behaviors of different problems so as to improve the lie detection performance. The method mainly comprises the following two steps: 1) text, visual and audio feature expression extraction and multi-mode fusion; 2) associative feature learning and lie detection classification based on graph neural networks.

FIG. 1 is a diagram illustrating the extraction of expression and multi-modal fusion of relevant features including lie behavior of text, vision, and audio according to the present invention. As shown in the figure, relevant features of lie behaviors including texts, videos and audios are expressed and extracted to form corresponding matrixes, and multi-mode fusion feature matrixes are obtained after multi-mode fusion.

The related specific process is as follows:

feature expression extraction and multimodal fusion

The goal of this section is to obtain the three modality feature expression matrices and the fusion of the three modality feature expression matrices, respectively.

1. Text feature expression extraction

In the present invention, n is 20.

To learn lie-related text feature expressions from semantic relations, a long-short term memory recurrent neural network LSTM is used to obtain text information in a sequence. In addition to this, the focus on the more important words is based on an attention mechanism, taking into account that different words in the sentence contribute differently to the detection result.

First, the word vector feature matrix X is input into the LSTM layer, using bi-directional LSTM, each word vector X can be represented by two LSTMs in forward and backward directions_iGenerating a hidden state

And

adding two hidden state vectors, and splicing n hidden states to form a text representation matrix H

Wherein the content of the first and second substances,

based on the attention mechanism, the importance weight of the word at each position in the text representation matrix H is obtained and is given to the text representation matrix through bit multiplication:

A_st＝softmax(W₁H+b₁)

H′＝H×A_st+H

wherein the content of the first and second substances,

a distribution matrix of the importance weights is represented,

and

is a trainable parameter vector, softmax is an activation function,

converting H' from two-dimensional to one-dimensional vector

Obtaining a text characteristic expression matrix F through a feedforward full-connection neural network_T。

F_T＝ReLU[W₂H′+b₂)

Wherein

And

is a trainable vector of parameters, ReLU is an activation function,

2. visual feature expression extraction

Abstracting a video into a sequence of images containing T consecutive frames, using v ═ frame₁，frame₂，…，farme_TRepresents it. In the present invention, T is 20. First, for each frame in V, through a weight-shared pre-training model Resnet_iA feature map is acquired and then followed by a (2 x 2) average pooling operation to narrow down the feature map, i.e.,

the feature map vector of each frame image is changed from 3 dimensions to 1 dimension, and T feature maps are spliced into a matrix:

M′＝concat[[M′₁，M′₂，…，M_T′])

wherein

A feature map matrix representing the video.

To model the timing information in video, a mechanism of attention is applied to pay attention to the images in the image sequence that contribute more to determining whether to lie, and the timing variation of the images in the video is learned through the long-short term memory recurrent neural network LSTM.

Because each graph is contained in the characteristic graph matrix M' of the videoLike the corresponding feature map matrix, therefore, firstly, the attention mechanism is applied to obtain the importance distribution of each feature map in the feature map matrix, and the video feature map matrix with the weight redistributed according to the importance distribution is obtained by the counterpoint multiplication

A_sv＝softmax(W₃M′+b₃)

Wherein the content of the first and second substances,

and

is a vector of parameters that can be trained,

video feature map matrix

Input into LSTM:

finally, h is fed through a feed-forward fully-connected network_TMapping to F_TIn the same dimensional space:

F_V＝ReLU(W₄h_T+b₄)

wherein the content of the first and second substances,

is a matrix of the expression of the characteristics of the video,

and

is a trainable vector of parameters and ReLU is an activation function.

3. Audio feature expression extraction

A ═ a is used for a sequence for abstracting audio into a continuous sound signal₁，a₂，…，a_LRepresents it. In the present invention, L is 200. Extracting feature expression from original sound signals by using a pre-trained SincNet network to obtain

D is generated by feedforward fully-connected neural network_AMapping to F_TIn the same dimensional space:

F_A＝ReLU(W₅D_A+b₅)

wherein the content of the first and second substances,

is a matrix of the expression of the audio features,

and

is a trainable vector of parameters and ReLU is an activation function.

4. Feature expression matrix fusion

Splicing the text, video and audio feature expression matrixes obtained above to obtain a multi-mode feature matrix F_M′：

F_M′＝concat([F_T，F_V，F_A])

Wherein

Obtaining importance contribution weights of different features in the multi-mode feature matrix through an attention mechanism, and giving the weights to the multi-mode feature matrix through counterpoint multiplication to obtain a multi-mode fusion feature matrix F_M：

A_mm＝softmax(W₆F_M′+b₆)

F_M＝A_mm×F_M′+F_M′

Wherein the content of the first and second substances,

and

is a vector of parameters that can be trained,

is a multi-modal fusion feature matrix.

Fig. 2 is a flowchart of the learning of associated features and lie detection classification based on the graph neural network according to the present invention, and it can be seen from the figure that, for the obtained multi-modal fusion feature matrix, the learning of associated features and the lie detection classification based on the graph neural network are performed to determine whether an individual lies, and the process is specifically described as follows:

associative feature learning and lie detection classification based on graph neural networks

The aim of the part is to learn the relevance of behavior performances corresponding to different problems by constructing a graph neural network and classify whether the questions lie or not.

1. Associative feature learning based on graph neural networks

First, how to construct the problem map is described. Representing the set of user individuals as O ═ { O ═ O₁，o₂，…，o_mAnd (c) the step of (c) in which,

o_iis an individual, each individual possibly containing responses to several questions asked, i.e.

This graph is defined as G (U, E), where U represents the set of problem nodes,

that is, for each problem node belonging to one individual in O, the problem node is represented by the multi-modal fusion feature matrix obtained by the behavior corresponding to the problem, E represents the set of edges, and if any two nodes in U satisfy U_i∈o_gAnd u is_j∈o_g(o_gE.o), then e_ij＝1，e_ji＝1。

For problem u_iIs represented by a vector

Neighbor influence is applied, that is, the performance of a certain question by the same individual is influenced by the responses of other questions, and the responses of the same individual to different questions have similarity. Considering different problem pairs u_iThe different effects of the responsive performance employ a mechanism of attention at adjacent question levels.

First, through a full link layer, from

To find out any problem u_iRepresents the initial hidden state of the vector

Wherein

Are trainable parameters.

For each problem pair (u) with an edge_i，u_j) It is expressed as a vector

And

splicing and obtaining attention factor by using a full connection layer

Wherein

Are trainable parameters.

For u_i∈o_gIn this way, a vector of the neighbor attention coefficients of ui is obtained:

wherein, | o_gIs u |_iThe number of adjacent neighbors is obtained, and each neighbor node pair problem u is obtained through a softmax function_iInfluence exerted：

Will be provided with

Is shown as

By passing

Update u_iIs in a hidden state of

Finally, a layer of fully-connected network is used for obtaining a final expression matrix of the node problem

Wherein

Are trainable parameters.

Lie detection classification

Using a feed-forward fully-connected neural network to obtain classification results:

wherein y is₀，y₁To representFor problem u_iWith or without the occurrence of fraudulent activity,

are trainable parameters.

The invention has the advantages of

The invention redefines the existing lie detection task by constructing a graph neural network associated with the question, considers the relevance of the performances of the same individual when answering the question, and simultaneously makes up the defect of single data source lie detection by a multi-mode fusion mode.

Claims

1. A multi-modal fusion lie detection method is characterized by comprising the following steps:

2. The multi-modal fusion lie detection method according to claim 1, wherein the obtaining of the lie behavior related feature expression matrix of the text in step 1 specifically comprises:

Acquiring text information in a sequence by using a long-short term memory recurrent neural network (LSTM), specifically, inputting a word vector feature matrix X into an LSTM layer, wherein the LSTM layer is a bidirectional LSTM, and each word vector X is respectively obtained by a forward LSTM and a backward LSTM_iGenerating a hidden state

And

wherein the content of the first and second substances,

A_st＝softmax(W₁H+b₁) (2)，

H′＝H×A_st+H (3)，

wherein the content of the first and second substances,

a distribution matrix of the importance weights is represented,

and

is a trainable parameter vector, softmax is an activation function,

converting H' from two dimensions to one-dimensional vectors, i.e.

F_T＝ReLU(W₂H′+b₂) (4)，

wherein

And

is a trainable vector of parameters, ReLU is an activation function,

3. a multi-modal fusion lie detection method according to claim 2, wherein n-20.

4. The multi-modal fusion lie detection method according to claim 3, wherein the obtaining of the visual lie behavior related feature expression matrix in step 1 specifically comprises:

M′＝concat([M′₁，M′₂，…，M_T′]) (6),

wherein

A feature map matrix representing the video;

As shown in formulas (7) and (8):

A_sv＝softmax(W₃M′+b₃) (7),

wherein the content of the first and second substances,

and

is a vector of parameters that can be trained,

mapping a video feature map matrix

Input to the LSTM, as shown in equation (9):

F_V＝ReLU(W₄h_T+b₄) (10)，

wherein the content of the first and second substances,

is a matrix of the expression of the characteristics of the video,

and

is a trainable vector of parameters and ReLU is an activation function.

5. The method of claim 4, wherein T-20.

6. The multi-modal fusion lie detection method according to claim 5, wherein the obtaining of the lie behavior related feature expression matrix of the audio in step 1 specifically comprises:

F_A＝ReLU(W₅D_A+b₅) (12)，

wherein the content of the first and second substances,

is a matrix of the expression of the audio features,

and

is a trainable vector of parameters and ReLU is an activation function.

7. The method of claim 6, wherein L is 200 in the sound signal sequence A.

8. The method for detecting the multi-modal fusion lie according to claim 7, wherein the obtaining the multi-modal fusion feature matrix in step 1 specifically includes:

F_M′＝concat([F_T，F_V，F_A]) (13)，

wherein

A_mm＝softmax(W₆F_M′+b₆) (14)，

F_m＝A_mm×F_M′+F_M′ (15)，

wherein the content of the first and second substances,

and

is a vector of parameters that can be trained,

is a multi-modal fusion feature matrix.

9. The multi-modal fusion lie detection method according to claim 8, wherein the performing associated feature learning based on the graph neural network in step 2 specifically comprises:

that is, each problem node belongs to one of the O units;

For problem u_iIs represented by a vector

Applying neighbor influence, namely the performance of a certain question by the same individual is influenced by the responses of other questions, and the responses of the same individual to different questions have similarity;

through a fully connected layer, from

To find out any problem u_iRepresents the initial hidden state of the vector

As shown in equation (16):

wherein

Is a trainable parameter;

for each problem pair (u) with an edge_i，u_j) It is expressed as a vector

And

splicing and obtaining attention factor by using a full connection layer

As shown in equation (17):

wherein

Is a trainable parameter;

for u_i∈o_gTo obtain u_iVector of neighbor attention coefficients

As shown in equation (18):

wherein, | o_gIs u |_iThe number of adjacent neighbors;

As shown in equation (19):

will be provided with

Is shown as

By passing

Updating u_iIs in a hidden state of

As shown in equation (20):

where sigma is an activation function sigmoid,

As shown in equation (21):

wherein

Are trainable parameters.

10. The multi-modal fusion lie detection method according to claim 9, wherein the lie detection classification based on the graph neural network in step 2 specifically includes:

is a trainable referenceAnd (4) counting.

11. A multi-modal fusion lie detection method according to any of claims 1-10, used for lie detection in real court trial or trial situations.