CN116522212A

CN116522212A - Lie detection method, device, equipment and medium based on image text fusion

Info

Publication number: CN116522212A
Application number: CN202310815657.2A
Authority: CN
Inventors: 陶建华; 易国峰; 范存航; 吕钊
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-08-01
Anticipated expiration: 2043-07-05
Also published as: CN116522212B

Abstract

The disclosure relates to a lie detection method, device, equipment and medium based on image text fusion, wherein the lie detection method comprises the following steps: performing frame sampling and voice conversion text processing on video data of a user to be tested to obtain a plurality of video frame images and texts; based on a pre-trained space-time converter model, extracting features of the plurality of video frame images to obtain visual image features containing space-time fusion dimensions; based on a pre-trained text feature extraction model, extracting features of the text to obtain text features; based on a pre-trained feature fusion model, fusing the visual image features and the text features to obtain fusion features; and inputting the fusion characteristics into a pre-trained classification model, and outputting to obtain the lie detection result of the user to be detected. The method is helpful for improving accuracy of lie detection, and effectively reducing complexity in lie detection process compared with three modes or more.

Description

Lie detection method, device, equipment and medium based on image text fusion

Technical Field

The disclosure relates to the technical field of artificial intelligence and data processing, in particular to a lie detection method, device, equipment and medium based on image text fusion.

Background

In forensic interrogation, psychological research, criminal investigation, etc., lie detection and analysis techniques are often required to aid in support. In the lie detection technology, in the related art, a test is performed based on a physiological signal, for example, whether a speaker is lying or not is determined based on a heart rate, brain waves, respiration states, skin electric activities, and the like. Some solutions are lie detection by lie feature recognition of the speaker's voice. Some of the solutions are multi-modal lie detection, and lie detection is performed by combining three or more types of modal data.

In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the related art: since the physiological response of a person is affected by a number of different factors, which are not necessarily related to fraud, the accuracy of the manner in which lie tests are performed based on physiological signals is relatively low; these techniques may also present objective problems that violate the associated ethical and legal aspects of personal privacy; voice signal-based lie detection techniques also have certain limitations and uncertainties, and accuracy is still to be improved, for example, people can be affected by many factors, such as emotion, environmental noise, etc., when speaking, and these factors may interfere with voice signal characteristics; in addition, some testees may take fraudulent measures to interfere with the detection result, such as intentionally changing the speaking style or using a voice transducer, etc., resulting in inaccurate detection results; for the technology of detecting in a multi-mode manner, the detection accuracy can be relatively improved, however, the processing of three modes or more is required, and the detection is relatively complex. Therefore, how to reduce the detection complexity while ensuring the required detection accuracy is a technical problem to be solved.

Disclosure of Invention

To solve the above technical problems or at least partially solve the above technical problems, embodiments of the present disclosure provide a lie detection method, apparatus, device, and medium based on image text fusion.

In a first aspect, embodiments of the present disclosure provide a lie detection method based on image text fusion. The lie detection method includes: performing frame sampling and voice conversion text processing on video data of a user to be tested to obtain a plurality of video frame images and texts; based on a pre-trained space-time converter model, extracting features of the plurality of video frame images to obtain visual image features containing space-time fusion dimensions; based on a pre-trained text feature extraction model, extracting features of the text to obtain text features; based on a pre-trained feature fusion model, fusing the visual image features and the text features to obtain fusion features; and inputting the fusion characteristics into a pre-trained classification model, and outputting to obtain the lie detection result of the user to be detected.

In some embodiments, the above-described space-time transducer model comprises: at least one layer of spatiotemporal associative learning network comprising: a temporal attention network and a spatial attention network connected to each other; in each layer of time-dependent learning network, the output of the time-attention network is used as the input of the space-attention network, or the output of the space-attention network is used as the input of the time-attention network.

In some embodiments, the above-described space-time transducer model further comprises: a feed-forward neural network; wherein the output of the temporal attention network and the output of the spatial attention network are used together as the input of the feedforward neural network, and the output of the feedforward neural network is used as the output of the space-time transducer model.

In some embodiments, the feature extraction of the plurality of video frame images based on the pre-trained spatio-temporal transducer model to obtain visual image features including spatio-temporal fusion dimensions includes: dividing each video frame image into a plurality of image blocks, wherein the image blocks are not overlapped; the image blocks of the plurality of video frame images are input into the space-time transducer model, the correlation between the image blocks of different video frame images is identified by a pre-trained time attention network, and the correlation between the plurality of image blocks in the same video frame image is identified by a pre-trained spatial attention network.

In some embodiments, the feature fusion model includes: the cross-modal interaction learns the transducer model and the self-attention model. Based on a pre-trained feature fusion model, fusing the visual image features and the text features to obtain fusion features, including: based on the cross-modal interactive learning transducer model, performing cross-modal interactive learning between the visual image features and the text features by adopting a multi-head self-attention mechanism to obtain enhanced visual image features and enhanced text features; performing cross-modal feature stitching according to the enhanced visual image features and the enhanced text features to obtain stitched image text features; and carrying out deep fusion learning on the text features of the spliced image based on the self-attention model to obtain fusion features.

In some embodiments, performing cross-modal feature stitching according to the enhanced visual image feature and the enhanced text feature to obtain a stitched image text feature, including: and splicing the first enhanced text feature corresponding to the first morpheme in the enhanced text features with the enhanced visual image feature to obtain spliced image text features. Alternatively, in other embodiments, cross-modal feature stitching is performed according to the enhanced visual image feature and the enhanced text feature to obtain a stitched image text feature, including: sequentially inputting the enhanced text features into a normalization layer and a feedforward neural network for processing to obtain target enhanced text features, and sequentially inputting the enhanced visual image features into the normalization layer and the feedforward neural network for processing to obtain target enhanced visual image features; and splicing the first target enhanced text feature corresponding to the first morpheme in the target enhanced text feature with the target enhanced visual image feature to obtain a spliced image text feature.

In some embodiments, the lie detection method further includes: performing frame sampling and voice conversion text processing on the video sample data to obtain a plurality of video frame image samples and text samples; based on a space-time converter model to be trained, extracting features of the plurality of video frame image samples to obtain visual image sample features containing space-time fusion dimensions; based on a text feature extraction model to be trained, extracting features of the text sample to obtain text sample features; based on a feature fusion model to be trained, fusing the visual image sample features and the text sample features to obtain sample fusion features; inputting the sample fusion characteristics into a classification model to be trained, and outputting to obtain a sample lie detection result; and synchronously training the space-time transducer model, the text feature extraction model, the feature fusion model and the classification model to be trained by taking the real result of the video sample data as a training label and taking the difference between the real result and the output sample lie detection result as a target, so as to obtain a pre-trained space-time transducer model, a pre-trained text feature extraction model, a pre-trained feature fusion model and a pre-trained classification model.

In a second aspect, embodiments of the present disclosure provide a method of constructing a lie detection model. The method for constructing the lie detection model comprises the following steps: performing frame sampling and voice conversion text processing on the video sample data to obtain a plurality of video frame image samples and text samples; based on a space-time converter model to be trained, extracting features of the plurality of video frame image samples to obtain visual image sample features containing space-time fusion dimensions; based on a text feature extraction model to be trained, extracting features of the text sample to obtain text sample features; based on a feature fusion model to be trained, fusing the visual image sample features and the text sample features to obtain sample fusion features; inputting the sample fusion characteristics into a classification model to be trained, and outputting to obtain a sample lie detection result; and synchronously training a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model by taking the real result of the video sample data as a training label and taking the difference between the real result and an output sample lie detection result as a target, wherein the obtained pre-trained space-time transducer model, the text feature extraction model, the feature fusion model and the classification model form a lie detection model. The input of the pre-trained space-time transducer model and the input of the pre-trained text feature extraction model are used as two inputs of the lie detection model, the output of the pre-trained space-time transducer model and the output of the pre-trained text feature extraction model are used as the input of the pre-trained feature fusion model, the output of the pre-trained feature fusion model is used as the input of the pre-trained classification model, and the output of the pre-trained classification model is used as the output of the lie detection model.

In a third aspect, embodiments of the present disclosure provide a lie detection apparatus based on image text fusion. The lie detection apparatus includes: the system comprises a first video processing module, a first image feature extraction module, a first text feature extraction module, a first feature fusion module and a first lie detection module. The first video processing module is used for performing frame sampling and voice conversion text processing on video data of a user to be tested to obtain a plurality of video frame images and texts. The first image feature extraction module is configured to perform feature extraction on the plurality of video frame images based on a pre-trained space-time transducer model, so as to obtain a visual image feature including a space-time fusion dimension. The first text feature extraction module is used for extracting features of the text based on a pre-trained text feature extraction model to obtain text features. The first feature fusion module is used for fusing the visual image features and the text features based on a pre-trained feature fusion model to obtain fusion features. The first lie detection module is used for inputting the fusion features into a pre-trained classification model and outputting a lie detection result of the user to be detected.

In a fourth aspect, embodiments of the present disclosure provide an apparatus for constructing a lie detection model. The device for constructing the lie detection model comprises: the system comprises a second video processing module, a second image feature extraction module, a second text feature extraction module, a second feature fusion module, a second lie detection module and a training module. The second video processing module is used for performing frame sampling and voice conversion text processing on the video sample data to obtain a plurality of video frame image samples and text samples. The second image feature extraction module is configured to perform feature extraction on the plurality of video frame image samples based on a space-time transducer model to be trained, so as to obtain a visual image sample feature including a space-time fusion dimension. The second text feature extraction module is used for extracting features of the text sample based on a text feature extraction model to be trained, and obtaining text sample features. The second feature fusion module is used for fusing the visual image sample features and the text sample features based on a feature fusion model to be trained to obtain sample fusion features. The second lie detection module is used for inputting the sample fusion characteristics into a classification model to be trained, and outputting a sample lie detection result. The training module is configured to perform synchronous training on a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model according to the real result of the video sample data as a training tag, with the aim of reducing the gap between the real result and the output sample lie detection result, so as to obtain a pre-trained space-time transducer model, a text feature extraction model, a feature fusion model and a classification model to form a lie detection model. The input of the pre-trained space-time transducer model and the input of the pre-trained text feature extraction model are used as two inputs of the lie detection model, the output of the pre-trained space-time transducer model and the output of the pre-trained text feature extraction model are used as the input of the pre-trained feature fusion model, the output of the pre-trained feature fusion model is used as the input of the pre-trained classification model, and the output of the pre-trained classification model is used as the output of the lie detection model.

In a fifth aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; and the processor is used for realizing the lie detection method based on the image text fusion or constructing a lie detection model when executing the program stored in the memory.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable storage medium. The computer-readable storage medium stores thereon a computer program which, when executed by a processor, implements the lie detection method or the method of constructing a lie detection model based on image text fusion as described above.

The technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:

the method comprises the steps of performing frame sampling and voice conversion text processing on video data of a user to be detected to obtain a plurality of video frame images and texts of the user to be detected in a target period, performing feature extraction on the plurality of video frame images based on a space-time converter model to obtain visual image features containing space-time fusion dimensions, and compared with a general image extraction model, extracting richer and deeper image features in two dimensions of time and space, and learning association between image blocks of different video frame images and association between different image blocks in the same video frame image; the method is favorable for enabling text features to be subjected to interactive learning and fusion complementation with the visual image features during feature fusion, obtaining more accurate fusion features, improving accuracy of lie detection, and simultaneously, performing lie detection based on features of two modes of an image and a text, and compared with three modes or more, complexity in a lie detection process is effectively reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described below, and it will be apparent to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 schematically illustrates a flow chart of a lie detection method based on image text fusion according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a model structure schematic of a space-time transducer model and a process schematic for image feature extraction in accordance with an embodiment of the present disclosure;

FIG. 3 schematically illustrates a detailed implementation flowchart of step S140, according to an embodiment of the present disclosure;

fig. 4 schematically illustrates a flow chart of a lie detection method based on image text fusion according to another embodiment of the present disclosure;

fig. 5 schematically illustrates a block diagram of a lie detection apparatus based on image text fusion according to an embodiment of the present disclosure;

Fig. 6 schematically shows a block diagram of a device for constructing a lie detection model according to an embodiment of the present disclosure;

fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some, but not all, embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the disclosure, are within the scope of the disclosure.

A first exemplary embodiment of the present disclosure provides a lie detection method based on image text fusion. The lie detection method of the present embodiment may be performed by an electronic device having computing capabilities, for example, by a terminal device or by a server.

Fig. 1 schematically illustrates a flow chart of a lie detection method based on image text fusion according to an embodiment of the present disclosure.

Referring to fig. 1, a lie detection method based on image text fusion according to an embodiment of the present disclosure includes the following steps: s110, S120, S130, S140, and S150.

In step S110, frame sampling and speech conversion text processing are performed on the video data of the user to be tested, so as to obtain a plurality of video frame images and texts.

For example, the video data of the user to be tested is video data containing voice and image of the user to be tested, and the total duration of the video data and the selected period of time can be adjusted as required. The video data may be video data acquired in a scene where lie detection is required.

Sampling the video data of the user to be tested by frames, for example, sampling at intervals according to time sequence, and sampling a frame of video frame image every preset time length; or sampling frames at different time intervals; or randomly sampled.

For example, frame sampling processing is performed on video data of a user to be detected within 2 minutes in a certain period of time to obtain T video frames, wherein T is more than or equal to 2 and is a positive integer; and performing voice conversion text processing on the voice in the video data within 2 minutes to obtain a corresponding text.

In step S120, feature extraction is performed on the plurality of video frame images based on the pre-trained space-time converter model 101, so as to obtain a visual image feature including a space-time fusion dimension.

Fig. 2 schematically illustrates a model structure diagram of a space-time transducer model and a process diagram for image feature extraction according to an embodiment of the present disclosure.

Referring to the lower dashed box in FIG. 2, in some embodiments, the above-described space-time transducer model 101 comprises: at least one layer of space-time associated learning network, each layer of space-time associated learning network comprising: a temporal attention network 201 and a spatial attention network 202 connected to each other; in each layer of time-dependent learning network, the output of the time-attention network is taken as the input of the spatial-attention network, or the output of the spatial-attention network is taken as the input of the time-attention network, and the output of the time-attention network 201 is taken as the input of the spatial-attention network 202 in fig. 2 as an example. The layer number (as super parameter) of the time-space correlation learning network can be adjusted according to the training result of the actual model.

In some embodiments, feature extraction is performed on a plurality of video frame images based on at least one layer of space-time correlation learning network, and visual image features containing space-time fusion dimensions are output.

In other embodiments, as shown with reference to the overall dashed box in FIG. 2, the space-time transducer model 101 further comprises: a feed-forward neural network 203; wherein the output of the temporal attention network 201 and the output of the spatial attention network 202 are used together as the input of the feedforward neural network 203, and the output of the feedforward neural network 203 is used as the output of the space-time transducer model 101. By setting the feedforward neural network, the method is beneficial to reducing the dimension of the image features containing the space-time fusion dimension obtained by the recognition of at least one layer of time-associated learning network to the target dimension for output.

In some embodiments, referring to fig. 2, in step S120, feature extraction is performed on the plurality of video frame images based on the pre-trained space-time converter model 101 to obtain a visual image feature including a space-time fusion dimension, including: dividing each video frame image into a plurality of image blocks, wherein the image blocks are not overlapped; image blocks of multiple video frame images are input into the above-described spatiotemporal transducer model, the correlation between image blocks located in different video frame images is identified by the pre-trained temporal attention network 201, and the correlation between image blocks located in the same video frame image is identified by the pre-trained spatial attention network 202.

For example, for a given video clip X ε T3 XH W, where T represents the total number of sampled video frames, H and W represent the height (H) and width (W) in the resolution of each video frame image, and 3 represents 3 channels. Each video frame image is divided into N non-overlapping image blocks of size P x P, where n=h x W/P2.

For example, the dimensions of the input plurality of video frame images are: t x N x (3 x P2), the dimensions of the output visual image features are: t x D, where T represents the total number of sampled video frames, N represents the total number of partitioned tiles in each frame of picture, 3 x P2 represents the feature dimension of each tile, and D represents the feature dimension of each video frame of picture.

Image feature extraction based on a space-time transform model enables visual representations with rich space-time features to be extracted from sparsely sampled video frames. Wherein the temporal attention network focuses the information of different video frame images of the same video on each other, and the spatial attention network focuses on the information interaction between different image blocks within each frame image. In combination, a powerful visual feature representation can be captured. Compared with the traditional convolutional neural network, only space information is considered, and time information is not utilized; the space-time transducer model of the embodiment of the disclosure adopts a transducer structure for modeling, and can process space and time information simultaneously, so that dynamic information in video can be captured better. Meanwhile, the space-time transducer model adopts a transducer structure which is successfully applied to the NLP (natural language processing) field, so that the model can be easily migrated to other video tasks, such as video classification, behavior recognition and the like.

In step S130, feature extraction is performed on the text based on the pre-trained text feature extraction model 102, so as to obtain text features.

In some embodiments, the text feature extraction model is a Bert model. For example, for the text entered, it is converted into the format required by the Bert model (such as with segmentation, adding special labels, adding semantic information, etc.). The processed text is input into a Bert model, and a word vector representation of each morpheme (token) is obtained. The resulting word vectors are pooled to obtain text features for the entire text.

In step S140, the visual image features and the text features are fused based on the pre-trained feature fusion model 103, so as to obtain fusion features.

Fig. 3 schematically shows a detailed implementation flowchart of step S140 according to an embodiment of the present disclosure.

In some embodiments, referring to the circled box in fig. 3, the feature fusion model 103 includes: the cross-modal interaction learns a transducer model 301 and a self-attention model 302.

Referring to fig. 3, in step S140, cross-modal feature stitching is performed according to the enhanced visual image feature and the enhanced text feature to obtain a stitched image text feature, which includes the following steps: s310, S320 and S330.

In step S310, based on the cross-modal interactive learning transducer model 301, cross-modal interactive learning between the visual image features and the text features is performed by using a multi-head self-attention mechanism, so as to obtain enhanced visual image features and enhanced text features.

Based on a multi-head self-attention mechanism, the visual image features corresponding to each video frame image perform cross-modal interactive learning from text features (here, text features corresponding to all morphemes in the text after voice-to-text processing belong to a soft alignment mode and are not completely aligned in time in a strict sense (namely, hard alignment)), so as to obtain enhanced visual image features after the text features are learned; based on a multi-head self-attention mechanism, the text feature of each morpheme carries out cross-mode interactive learning from the visual image features corresponding to a plurality of video frame images, and the enhanced text features after the visual image features are learned are obtained.

Taking text features as an example, cross-modal interactive learning from visual image features. Text features provide a query vector Q, visual image features provide a key vector K and a value vector V, which are interacted with using a multi-modal-Head Attention (multi-Attention) mechanism.

Defining a query asThe bond is defined as +.>The value is defined as +.>Wherein W is a weight vector, F is an input feature, the subscript m epsilon { t, a }, t represents text, a represents an image, and the calculation formula is as follows:

，（1）

，（2）

wherein, the liquid crystal display device comprises a liquid crystal display device,a scoring matrix representing cross-modal interactive learning of text features from visual image features,represent the first of cross-modal attentioniHead(s) and(s) of a person>For scaling factors for letting the scoreThe value of (2) is smoother, so that the gradient is more stable; the T in the upper right hand corner represents the transpose.

By softmax normalization, the calculation method of Multi-Head-CA (MH-CA for short) according to the method comprises the following steps:

，（3）

wherein, the liquid crystal display device comprises a liquid crystal display device,a weight matrix representing the attention of multiple heads,nrepresenting the total number of heads of the multi-head attention.

In step S320, cross-modal feature stitching is performed according to the enhanced visual image features and the enhanced text features, so as to obtain stitched image text features.

In some embodiments, in step S320, cross-modal feature stitching is performed according to the enhanced visual image feature and the enhanced text feature to obtain a stitched image text feature, which includes: and splicing a first enhanced text feature corresponding to a first morpheme (CLS token) in the enhanced text features with the enhanced visual image feature to obtain spliced image text features.

Because the enhanced text features obtained after cross-modal feature interactive learning through the multi-head self-attention mechanism comprise the situation after interactive learning is carried out on all video frame images, the first morpheme is adopted to represent the whole text for feature stitching, the whole information is not omitted, and the calculation amount of fusion after the subsequent feature stitching can be reduced.

Alternatively, in other embodiments, in step S320, cross-modal feature stitching is performed according to the enhanced visual image feature and the enhanced text feature to obtain a stitched image text feature, which includes: sequentially inputting the enhanced text features into a normalization layer and another feedforward neural network for processing to obtain target enhanced text features, and sequentially inputting the enhanced visual image features into the normalization layer and the feedforward neural network for processing to obtain target enhanced visual image features; and splicing the first target enhanced text feature corresponding to the first morpheme in the target enhanced text feature with the target enhanced visual image feature to obtain a spliced image text feature.

In this embodiment, after the enhanced text feature and the enhanced visual image feature are obtained, both are sequentially input into a normalization layer (LayerNomal) and another feedforward neural network (differentially described with feedforward neural network 203 in the previous spatiotemporal converter model) for processing. The main functions of the normalization layer and the feedforward neural network are as follows: and performing dimension reduction processing on the enhanced text features and the enhanced visual image features, normalizing to the same dimension, preventing gradient disappearance or gradient explosion during model training, and improving the training convergence effect of the fusion model.

In step S330, deep fusion learning is performed on the text features of the stitched image based on the self-attention model 302, so as to obtain fusion features.

By carrying out feature deep fusion learning on the text features of the spliced image based on the self-attention model, the accuracy of feature fusion learning can be further improved on the basis of cross-modal interactive learning of the transducer model learning features.

The visual image features containing space-time fusion dimensions and the extracted text features are used for fusion, and the fusion can enable the image features (such as facial expressions, limb action features and the like of a speaker) and the text features to mutually learn and supplement, so that a more accurate detection result is obtained, and compared with single-mode detection (such as lie recognition based on voice of the speaker), the detection result is better in detection accuracy.

In step S150, the fusion feature is input into a pre-trained classification model, and a lie detection result of the user to be detected is output.

In some embodiments, the classification model is a multi-layer perceptron model. The multi-layer perceptron model comprises a full connection layer, a ReLU activation function and an output layer.

In the embodiment including steps S110 to S150, frame sampling and voice conversion text processing are performed on video data of a user to be tested to obtain a plurality of video frame images and texts of the user to be tested in a target period, feature extraction is performed on the plurality of video frame images based on a space-time transform model to obtain visual image features including space-time fusion dimensions, and compared with a general image extraction model, richer and deeper image features can be extracted in two dimensions of time and space, and correlation between image blocks of different video frame images and correlation between different image blocks in the same video frame image are learned; the method is favorable for enabling text features to be subjected to interactive learning and fusion complementation with the visual image features during feature fusion, obtaining more accurate fusion features, improving accuracy of lie detection, and simultaneously, performing lie detection based on features of two modes of an image and a text, and compared with three modes or more, complexity in a lie detection process is effectively reduced.

Fig. 4 schematically shows a flow chart of a lie detection method based on image text fusion according to another embodiment of the present disclosure.

In other embodiments, the lie detection method includes steps S110 to S150, and further includes a step of pre-training the spatiotemporal transducer model 401 to be trained, the text feature extraction model 402, the feature fusion model 403, and the classification model 404: s410, S420, S430, S440, S450 and S460, steps S410-S460 are illustrated in FIG. 4 for simplicity of illustration.

In step S410, frame sampling and speech conversion text processing are performed on the video sample data, so as to obtain a plurality of video frame image samples and text samples.

In step S420, feature extraction is performed on the plurality of video frame image samples based on the space-time converter model 401 to be trained, so as to obtain features of the visual image samples including the space-time fusion dimension.

In step S430, feature extraction is performed on the text sample based on the text feature extraction model 402 to be trained, so as to obtain text sample features.

In step S440, the visual image sample feature and the text sample feature are fused based on the feature fusion model 403 to be trained, so as to obtain a sample fusion feature.

In step S450, the above-mentioned sample fusion features are input into the classification model 404 to be trained, and the sample lie detection result is output.

In step S460, according to the real result of the video sample data as a training label, the space-time transform model 401, the text feature extraction model 402, the feature fusion model 403 and the classification model 404 to be trained are synchronously trained with the aim of reducing the gap between the real result and the output sample lie detection result, so as to obtain a pre-trained space-time transform model 101, a pre-trained text feature extraction model 102, a pre-trained feature fusion model 103 and a pre-trained classification model 104.

For example, with supervised training of the manually collected labeled dataset, the optimizer uses AdamW and the Loss function is a Cross-Entropy Loss (Cross-Entropy Loss) function. By way of example, the super-parameters of the model are set as follows: batch size 32, learning rate 5e-5; the number of heads of the multi-head attention is 8, and the number of model layers is 4.

In the embodiment, through carrying out non-contact multi-mode lie detection by combining images and texts, the visual image features and text features are respectively extracted by utilizing a space-time transform model and a Bert model, then the image text features are fused by adopting a cross-mode transform model and a self-attention model, and lie detection is carried out by utilizing the fused features, so that the model is simple and effective, the accuracy of lie detection is greatly improved, and the detection complexity is reduced.

A second exemplary embodiment of the present disclosure provides a method of constructing a lie detection model. Details of the specific steps in this embodiment may refer to details of steps S410 to S460 in the first embodiment.

The method for constructing the lie detection model specifically comprises the following steps:

performing frame sampling and voice conversion text processing on the video sample data to obtain a plurality of video frame image samples and text samples;

based on a space-time converter model to be trained, extracting features of the plurality of video frame image samples to obtain visual image sample features containing space-time fusion dimensions;

based on a text feature extraction model to be trained, extracting features of the text sample to obtain text sample features;

based on a feature fusion model to be trained, fusing the visual image sample features and the text sample features to obtain sample fusion features;

inputting the sample fusion characteristics into a classification model to be trained, and outputting to obtain a sample lie detection result;

and synchronously training a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model by taking the real result of the video sample data as a training label and taking the difference between the real result and an output sample lie detection result as a target, wherein the obtained pre-trained space-time transducer model, the text feature extraction model, the feature fusion model and the classification model form a lie detection model.

The resulting pre-trained spatiotemporal transducer model 101, text feature extraction model 102, feature fusion model 103 and classification model 104 constitute a lie detection model. As shown in fig. 1 and 4, the input of the pre-trained spatiotemporal transducer model 101 and the input of the pre-trained text feature extraction model 102 are used as two inputs of the lie detection model, the output of the pre-trained spatiotemporal transducer model 101 and the output of the pre-trained text feature extraction model 102 are used as the input of the pre-trained feature fusion model 103, the output of the pre-trained feature fusion model 103 is used as the input of the pre-trained classification model 104, and the output of the pre-trained classification model 104 is used as the output of the lie detection model.

A third exemplary embodiment of the present disclosure provides a lie detection apparatus based on image text fusion.

Fig. 5 schematically shows a block diagram of a lie detection apparatus based on image text fusion according to an embodiment of the present disclosure.

Referring to fig. 5, a lie detection apparatus 500 based on image text fusion of the present embodiment includes: a first video processing module 501, a first image feature extraction module 502, a first text feature extraction module 503, a first feature fusion module 504, and a first lie detection module 505.

The first video processing module 501 is configured to perform frame sampling and voice conversion text processing on video data of a user to be tested, so as to obtain a plurality of video frame images and texts.

The first image feature extraction module 502 is configured to perform feature extraction on the plurality of video frame images based on a pre-trained spatiotemporal transform model, so as to obtain visual image features including spatiotemporal fusion dimensions. In some embodiments, the first image feature extraction module 502 contains or is capable of invoking the pre-trained spatiotemporal transducer model 101.

The first text feature extraction module 503 is configured to perform feature extraction on the text based on a pre-trained text feature extraction model, so as to obtain text features. In some embodiments, the first text feature extraction module 503 contains or is capable of invoking the pre-trained text feature extraction model 102.

The first feature fusion module 504 is configured to fuse the visual image feature and the text feature based on a pre-trained feature fusion model, so as to obtain a fused feature. In some embodiments, the first feature fusion module 504 contains or is capable of invoking the pre-trained feature fusion model 103.

The first lie detection module 505 is configured to input the fusion feature into a pre-trained classification model, and output a lie detection result of the user to be detected. In some embodiments, the first lie detection module 505 contains or is capable of invoking the pre-trained classification model 104.

In some embodiments, the apparatus 500 further comprises: a model building module, the model building module comprising: the system comprises a second video processing module, a second image feature extraction module, a second text feature extraction module, a second feature fusion module, a second lie detection module and a training module.

The second video processing module is configured to perform frame sampling and voice conversion text processing on the video sample data to obtain a plurality of video frame image samples and text samples.

The second image feature extraction module is configured to perform feature extraction on the plurality of video frame image samples based on the spatiotemporal transducer model 401 to be trained, so as to obtain features of the visual image samples including spatiotemporal fusion dimensions.

The second text feature extraction module is configured to perform feature extraction on the text sample based on the text feature extraction model 402 to be trained, so as to obtain text sample features.

The second feature fusion module is configured to fuse the visual image sample feature and the text sample feature based on a feature fusion model 403 to be trained, so as to obtain a sample fusion feature.

The second lie detection module is configured to input the sample fusion feature into the classification model 404 to be trained, and output a sample lie detection result.

The training module is configured to perform synchronous training on a space-time transducer model 401, a text feature extraction model 402, a feature fusion model 403, and a classification model 404 to be trained according to the real result of the video sample data as a training label, with the aim of reducing the gap between the real result and the output sample lie detection result, to obtain a pre-trained space-time transducer model 101, a text feature extraction model 102, a feature fusion model 103, and a classification model 104.

The relevant content of the first embodiment may be incorporated into the understanding of the present embodiment as appropriate.

A fourth exemplary embodiment of the present disclosure provides an apparatus for constructing a lie detection model.

Fig. 6 schematically shows a block diagram of a device for constructing a lie detection model according to an embodiment of the present disclosure.

Referring to fig. 6, an apparatus 600 for constructing a lie detection model includes: a second video processing module 601, a second image feature extraction module 602, a second text feature extraction module 603, a second feature fusion module 604, a second lie detection module 605, and a training module 606.

The second video processing module 601 is configured to perform frame sampling and voice conversion text processing on the video sample data, so as to obtain a plurality of video frame image samples and text samples.

The second image feature extraction module 602 is configured to perform feature extraction on the plurality of video frame image samples based on a space-time transducer model to be trained, so as to obtain features of the visual image samples including space-time fusion dimensions.

The second text feature extraction module 603 is configured to perform feature extraction on the text sample based on a text feature extraction model to be trained, so as to obtain text sample features.

The second feature fusion module 604 is configured to fuse the visual image sample feature and the text sample feature based on a feature fusion model to be trained, so as to obtain a sample fusion feature.

The second lie detection module 605 is configured to input the sample fusion feature into a classification model to be trained, and output a sample lie detection result.

The training module 606 is configured to perform synchronous training on a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model according to the real result of the video sample data as a training tag, with the aim of reducing the gap between the real result and the output sample lie detection result, so as to obtain a pre-trained space-time transducer model, a text feature extraction model, a feature fusion model and a classification model, which form a lie detection model.

The input of the pre-trained space-time transducer model and the input of the pre-trained text feature extraction model are used as two inputs of the lie detection model, the output of the pre-trained space-time transducer model and the output of the pre-trained text feature extraction model are used as the input of the pre-trained feature fusion model, the output of the pre-trained feature fusion model is used as the input of the pre-trained classification model, and the output of the pre-trained classification model is used as the output of the lie detection model.

Any of the functional modules included in the apparatus 500 or the apparatus 600 may be combined and implemented in one module, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. At least one of the functional modules included in apparatus 500 or apparatus 600 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-a-substrate, a system-on-a-package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuits, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the functional modules included in the apparatus 500 or the apparatus 600 may be implemented at least partly as a computer program module, which when executed may perform the corresponding functions.

A fifth exemplary embodiment of the present disclosure provides an electronic device.

Fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the disclosure.

Referring to fig. 7, an electronic device 700 provided by an embodiment of the present disclosure includes a processor 701, a communication interface 702, a memory 703, and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 complete communication with each other through the communication bus 704; a memory 703 for storing a computer program; the processor 701 is configured to implement a lie detection method based on image text fusion or a method of constructing a lie detection model as described above when executing a program stored in a memory.

The sixth exemplary embodiment of the present disclosure also provides a computer-readable storage medium. The computer-readable storage medium stores thereon a computer program which, when executed by a processor, implements the lie detection method or the method of constructing a lie detection model based on image text fusion as described above.

The computer-readable storage medium may be embodied in the apparatus/means described in the above embodiments; or may exist alone without being assembled into the apparatus/device. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It should be noted that, in the technical solution provided by the embodiment of the present disclosure, the related aspects of collecting, updating, analyzing, processing, using, transmitting, storing, etc. of the personal information of the user all conform to the rules of relevant laws and regulations, and are used for legal purposes without violating the public order colloquial. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A lie detection method based on image text fusion, comprising:

performing frame sampling and voice conversion text processing on video data of a user to be tested to obtain a plurality of video frame images and texts;

based on a pre-trained space-time converter model, extracting features of the plurality of video frame images to obtain visual image features containing space-time fusion dimensions;

based on a pre-trained text feature extraction model, extracting features of the text to obtain text features;

based on a pre-trained feature fusion model, fusing the visual image features and the text features to obtain fusion features;

and inputting the fusion characteristics into a pre-trained classification model, and outputting and obtaining a lie detection result of the user to be detected.

2. Lie detection method according to claim 1, characterized in that the spatiotemporal transducer model comprises: at least one layer of spatiotemporal associative learning network comprising: a temporal attention network and a spatial attention network connected to each other; in each layer of time-dependent learning network, the output of the time-attention network is used as the input of the space-attention network, or the output of the space-attention network is used as the input of the time-attention network.

3. Lie detection method according to claim 2, characterized in that the spatiotemporal transducer model further comprises: a feed-forward neural network;

wherein the output of the temporal attention network and the output of the spatial attention network are together as an input to the feedforward neural network, and the output of the feedforward neural network is as an output of the spatiotemporal transducer model.

4. A lie detection method according to claim 2 or 3, wherein the feature extraction of the plurality of video frame images based on a pre-trained spatiotemporal transducer model to obtain visual image features comprising spatiotemporal fusion dimensions comprises:

dividing each video frame image into a plurality of image blocks, wherein the image blocks are not overlapped;

inputting image blocks of a plurality of video frame images into the space-time transducer model, identifying the association between the image blocks of different video frame images by a pre-trained time attention network, and identifying the association between the plurality of image blocks in the same video frame image by a pre-trained spatial attention network.

5. Lie detection method according to claim 1, characterized in that the feature fusion model comprises: cross-modal interactive learning of a transducer model and a self-attention model;

The fusing of the visual image features and the text features based on the pre-trained feature fusion model to obtain fusion features comprises the following steps:

based on the cross-modal interactive learning transducer model, performing cross-modal interactive learning between the visual image features and the text features by adopting a multi-head self-attention mechanism to obtain enhanced visual image features and enhanced text features;

performing cross-modal feature stitching according to the enhanced visual image features and the enhanced text features to obtain stitched image text features;

and carrying out deep fusion learning on the text features of the spliced image based on the self-attention model to obtain fusion features.

6. Lie detection method according to claim 5, characterized in that cross-modal feature stitching is performed according to the enhanced visual image features and the enhanced text features, resulting in stitched image text features, comprising:

splicing a first enhanced text feature corresponding to a first morpheme in the enhanced text features with the enhanced visual image feature to obtain spliced image text features;

or sequentially inputting the enhanced text features into a normalization layer and a feedforward neural network for processing to obtain target enhanced text features, and sequentially inputting the enhanced visual image features into the normalization layer and the feedforward neural network for processing to obtain target enhanced visual image features; and splicing the first target enhanced text feature corresponding to the first morpheme in the target enhanced text feature with the target enhanced visual image feature to obtain a spliced image text feature.

7. A lie detection method according to claim 1, further comprising:

and synchronously training a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model to be trained by taking a real result of the video sample data as a training label and taking a difference between the real result and an output sample lie detection result as a target to obtain a pre-trained space-time transducer model, a pre-trained text feature extraction model, a pre-trained feature fusion model and a pre-trained classification model.

8. A method of constructing a lie detection model, comprising:

according to the real result of the video sample data as a training label, synchronously training a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model by taking the difference between the real result and an output sample lie detection result as a target, and forming a lie detection model by the obtained pre-trained space-time transducer model, the text feature extraction model, the feature fusion model and the classification model;

9. A lie detection device based on image text fusion, comprising:

the first video processing module is used for carrying out frame sampling and voice conversion text processing on video data of a user to be tested to obtain a plurality of video frame images and texts;

the first image feature extraction module is used for carrying out feature extraction on the plurality of video frame images based on a pre-trained space-time transducer model to obtain visual image features containing space-time fusion dimensions;

the first text feature extraction module is used for extracting features of the text based on a pre-trained text feature extraction model to obtain text features;

The first feature fusion module is used for fusing the visual image features and the text features based on a pre-trained feature fusion model to obtain fusion features;

and the first lie detection module is used for inputting the fusion characteristics into a pre-trained classification model and outputting a lie detection result of the user to be detected.

10. An apparatus for constructing a lie detection model, comprising:

the second video processing module is used for carrying out frame sampling and voice conversion text processing on the video sample data to obtain a plurality of video frame image samples and text samples;

the second image feature extraction module is used for carrying out feature extraction on the plurality of video frame image samples based on a space-time transducer model to be trained to obtain visual image sample features containing space-time fusion dimensions;

the second text feature extraction module is used for extracting features of the text sample based on a text feature extraction model to be trained to obtain text sample features;

the second feature fusion module is used for fusing the visual image sample features and the text sample features based on a feature fusion model to be trained to obtain sample fusion features;

The second lie detection module is used for inputting the sample fusion characteristics into a classification model to be trained and outputting a sample lie detection result;

the training module is used for taking the real result of the video sample data as a training label, taking the difference between the real result and the output sample lie detection result as a target, and synchronously training a space-time transducer model to be trained, a text feature extraction model, a feature fusion model and a classification model to be trained, so that the obtained pre-trained space-time transducer model, the text feature extraction model, the feature fusion model and the classification model form a lie detection model;

11. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1-8 when executing a program stored on a memory.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-8.