CN116091956A

CN116091956A - Video-based micro-expression recognition method, device and storage medium

Info

Publication number: CN116091956A
Application number: CN202211097935.7A
Authority: CN
Inventors: 赵幸福; 程安民; 赵国庆
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2023-05-09

Abstract

The invention discloses a method, a device and a storage medium for identifying micro-expressions based on videos, which can acquire target video frame data containing micro-expressions; inputting the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector; inputting the target video data into a trained feature extraction network model to obtain a second feature vector; and inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result. The invention combines the CNN and the transducer model, takes the space semantic and time domain semantic information into consideration, adopts high-resolution and low-frame-rate data input for the CNN model which is biased to the capacity of extracting the space semantic information, adopts low-resolution and high-frame-rate data input for the transducer model, can efficiently and accurately perform microexpressive recognition based on the invention, and can be applied to different application scenes.

Description

Video-based micro-expression recognition method, device and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method, an apparatus, and a storage medium for identifying a micro-expression based on video.

Background

With the continuous development of internet technology, financial products are also beginning to provide channels for online sales to facilitate purchase by people. In the sales process, a micro-expression recognition technology can be used for recording the video of the sales process, analyzing the true emotion which is attempted to be restrained in the video by the client, and realizing emotion recognition of the client so as to avoid risks in the sales process.

The existing micro-expression recognition algorithm needs to complete two tasks of feature extraction and expression recognition. The "feature extraction" refers to detecting and extracting micro expressions in a video image sequence through various feature extraction modes in a section of video image sequence subjected to a proper preprocessing method, for example, feature extraction based on optical flow or feature extraction based on an LBP-TOP operator (immediate space local texture operator). While "expression recognition" is actually a classification task. That is, the extracted microexpressions are classified into preset categories, so that the meaning of each microexpression is finally determined. Such as happiness, sadness, surprise, gas generation, aversion, fear, and the like.

The existing expression recognition method is realized through CNN (convolutional neural network). Firstly, training a constructed CNN model by using a training data set. And then classifying and identifying through the trained CNN model. However, when CNN is used for identification and classification, the CNN cannot use information about the video image sequence in the time domain (in the feature input layer of CNN, the correlation between each feature is not shown, and neurons in the input layer are equivalent). That is, CNN can only identify a single image frame in video image information and cannot learn the changes or associations between adjacent image frames. While a microexpressive is a motion that a customer presents to a localized area of the face over a short period of time. The related information in the time domain is also a very important part for identifying and distinguishing micro-expressions. Therefore, ignoring the relevant information in the time domain may result in a degradation of the micro-expression recognition performance by CNN.

The prior publication number is: the patent application of CN111738160A uses a pre-trained weight layer to independently extract the characteristics of each frame of picture, then separates a set according to similarity, normalizes the similarity in the set, and inputs the image characteristic vector combined with a weight value into a convolutional neural network or directly into softmax for expression classification. The method considers the related information in the time domain, but in actual processing, the characteristics are extracted independently by a single frame, the influence of the time domain is not fully considered in the characteristic extraction, and the precision is not high.

Thus, there is a need for a method of video-based microexpressive recognition.

Disclosure of Invention

The invention aims to solve the problems of efficiently and accurately identifying the micro-expression based on the video frame data.

The present invention has been made to solve the above-described technical problems such as how to efficiently and accurately perform recognition of micro-expressions based on video frame data. The embodiment of the invention provides a video-based micro-expression recognition method, a video-based micro-expression recognition device and a storage medium.

According to another aspect of an embodiment of the present invention, there is provided a video-based micro-expression recognition method, including:

acquiring target video frame data containing micro expressions;

inputting the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector;

inputting the target video data into a trained feature extraction network model to obtain a second feature vector;

and inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result.

Optionally, wherein the method further comprises:

sampling the target video frame data according to a preset step length, and inputting the sampled video frame data into a trained convolutional neural network classification model.

Optionally, the value range of the preset step length N is [1,8].

Optionally, wherein the method further comprises:

acquiring micro-expression video frame sample data marked with expression categories and macro-expression video frame sample data marked with expression categories;

sampling the micro-expression video frame sample data and the macro-expression video frame sample data according to a preset step length to obtain a first sample data set;

training an initial convolutional neural network classification model based on the first sample data set, using expression categories as labels, and using a cross entropy loss function optimization model to obtain a trained convolutional neural network classification model;

wherein, the convolutional neural network classification model comprises: an input layer, a convolutional neural network layer, a flattening layer, a full connection layer and a softmax classification layer which are connected in sequence; the output of the convolutional neural network layer is a feature vector; the convolutional neural network layer includes a plurality of 3D convolutional layers.

Optionally, wherein the method further comprises:

performing facial coding system FACS labeling on the micro-expression video frame sample data and the macro-expression video frame sample data to obtain a second sample data set;

Training an initial feature extraction network model based on the second sample data set to obtain a trained feature extraction network model;

wherein, the feature extraction network model includes: an input layer, an encoding layer, and a transducer decoding layer.

Optionally, inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result includes:

flattening and fully connecting the second feature vector, and reducing the feature dimension of the second feature vector to be the same as the feature dimension of the first feature vector so as to obtain a reduced-dimension second feature vector;

performing feature stitching on the first feature vector and the second feature vector subjected to dimension reduction to obtain stitching features;

inputting the spliced features to a full-connection layer, and carrying out micro-expression classification and identification based on a softmax classification layer to obtain a micro-expression identification result.

Optionally, wherein the method further comprises:

determining weights occupied by a convolutional neural network classification model and a feature extraction network model;

inputting the microexpressive video frame sample data into a trained convolutional neural network classification model to obtain a third feature vector;

Inputting the micro-expression video frame sample data into a feature extraction network model to obtain a fourth feature vector;

training the initial feature fusion classification network model based on the third feature and the fourth feature to obtain a trained feature fusion classification network model;

wherein, the feature fusion classification network model comprises: the device comprises a first input layer, a first characteristic output layer, a characteristic splicing layer, a first full-connection layer and a softmax classification layer which are sequentially connected, and a second input layer, a second characteristic output layer, a flattening layer and a second full-connection layer which are sequentially connected, wherein the second full-connection layer is connected with the characteristic splicing layer; the first feature output layer includes: convolutional neural network layers, flattening layers, and fully connected layers in the convolutional neural network classification model.

According to another aspect of an embodiment of the present invention, there is also provided a storage medium including a stored program, wherein the method described above is performed by a processor when the program is run.

According to still another aspect of an embodiment of the present invention, there is provided a video-based micro-expression recognition apparatus, the apparatus including:

the data acquisition module is used for acquiring target video frame data containing micro expressions;

The first feature vector acquisition module is used for inputting the target video frame data into a trained convolutional neural network classification model so as to acquire a first feature vector;

the second feature vector acquisition module is used for inputting the target video data into the trained feature extraction network model so as to acquire a second feature vector;

and the identification module is used for inputting the first feature vector and the second feature vector into the trained feature fusion classification network model to identify the micro-expression, and acquiring a micro-expression identification result.

a processor; and

a memory, coupled to the processor, for providing instructions to the processor to process the following processing steps:

acquiring target video frame data containing micro expressions;

and inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result. The embodiment of the invention provides a method, a device and a storage medium for identifying a micro-expression based on video, which can acquire target video frame data containing the micro-expression; inputting the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector; inputting the target video data into a trained feature extraction network model to obtain a second feature vector; and inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result. The invention combines the CNN and the transducer model, takes the space semantic and time domain semantic information into consideration, adopts high-resolution and low-frame-rate data input for the CNN model which is biased to the capacity of extracting the space semantic information, adopts low-resolution and high-frame-rate data input for the transducer model, can efficiently and accurately perform microexpressive recognition based on the invention, and can be applied to different application scenes.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

Exemplary embodiments of the present invention may be more completely understood in consideration of the following drawings:

fig. 1 is a block diagram of a hardware configuration of a computer terminal (or mobile device) for implementing a method according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a video-based micro-expression recognition method according to the first aspect of embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of a convolutional neural network classification model according to embodiment 1 of the present invention;

FIG. 4 is a schematic diagram of a feature extraction network model according to embodiment 1 of the invention;

FIG. 5 is a schematic diagram of a feature fusion classification model according to embodiment 1 of the invention;

fig. 6 is a schematic structural diagram of a video-based micro-expression recognition apparatus according to embodiment 2 of the present invention;

fig. 7 is a schematic structural diagram of a video-based micro-expression recognition apparatus according to embodiment 3 of the present invention.

Detailed Description

Hereinafter, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present invention are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present invention, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in an embodiment of the invention may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in the present invention is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In the present invention, the character "/" generally indicates that the front and rear related objects are an or relationship.

It should also be understood that the description of the embodiments of the present invention emphasizes the differences between the embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations with electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the foregoing, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Example 1

According to the present embodiment, a method embodiment of a video-based micro-expression recognition method is provided, and it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different from that herein.

The method embodiments provided in this embodiment may be performed in a mobile terminal, a computer terminal or similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal (or mobile device) for implementing a video-based micro-expression recognition method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …,102 n) processors 102 (the processors 102 may include, but are not limited to, a GPU, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the invention, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as a program instruction/data storage device corresponding to the video-based micro-expression recognition method in the embodiment of the present invention, and the processor 102 executes the software programs and modules stored in the memory 104 to perform various functional applications and data processing, that is, implement the video-based micro-expression recognition method of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that, in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

In the above-described operating environment, according to a first aspect of the present embodiment, there is provided a video-based micro-expression recognition method. Fig. 2 shows a schematic flow chart of the method, and referring to fig. 2, the method includes:

step 201, obtaining target video frame data containing micro-expressions.

Because the microexpressions typically last only within about 0.5 seconds, the present invention can specifically select 16 frames as the number of frames of the input samples for a video input with a frame rate of 30.

The target video frame data may be video data under different application scenarios, such as a scenario of loan application, insurance, etc.

Step 202, inputting the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector.

Optionally, wherein the method further comprises:

Optionally, the value range of the preset step length N is [1,8].

Optionally, wherein the method further comprises:

Referring to fig. 3, in the present invention, the structure of the convolutional neural network classification model includes: an input layer, a convolutional neural network layer, a flattening layer, a full connection layer and a softmax classification layer which are connected in sequence; the output of the convolutional neural network layer is a feature vector; the convolutional neural network layer adopts stacked 3D convolutional layers, including a plurality of 3D convolutional layers. The stacked 3D convolution is more sensitive to detail differences in video frames, and because the action amplitude of the micro-expressions is tiny, the stronger spatial semantic information is an essential feature of micro-expression recognition, and therefore, the stacked 3D convolution is adopted by the convolution neural network layer in the invention. The main function of the convolutional neural network classification model is to preferentially extract the spatial semantic features in the video frames, and the resolution is selected to be higher (256×256) so as to be beneficial to extracting the spatial semantic features; to guarantee speed, for convolutional neural network classification models, target video frame data is sampled by a step size N, 1> =n > =8.

In the invention, when training the convolutional neural network classification model, sample data needs to be acquired first. Comprising the following steps: and acquiring the marked micro-expression video, cutting the video according to the marked information, taking 16 frames as a sample, and collecting some micro-expression data to easily acquire the marked macro-expression video data, wherein every 16 frames are taken as a sample. And then, during training, sampling the micro-expression and macro-expression samples with the step length of N, taking the sampled data as a training sample of the convolutional neural network classification model, taking the expression type as a label, and using the cross entropy loss function optimization model, thereby obtaining a trained convolutional neural network classification model. The method comprises the steps of training a convolutional neural network classification model by using macro expression data sets which are easy to acquire and have a large number, and then performing fine adjustment on the basis by using micro expression data sets so as to relieve the problem of few micro expression samples and improve training effect.

In the invention, when the target video frame data is required to be identified, sampling is carried out according to a set step length, the sampled data is input into a trained convolutional neural network classification model, and a feature vector output by an FC1 layer of the convolutional neural network classification model as shown in fig. 3 is obtained as a first feature vector.

Step 203, inputting the target video data into the trained feature extraction network model to obtain a second feature vector.

Optionally, wherein the method further comprises:

Referring to fig. 4, in the present invention, the structure of the feature extraction network model includes: an input layer, a ViViT encoding layer, and a transducer decoding layer. The main function of the feature extraction network model is to bias the extraction of time domain semantic features in video frames. For the feature extraction network model, we choose a low resolution (128 x 128), high frame rate input, the frame number is the original video frame number, i.e., 16 frames. The main body structure of the model adopts ViViViT with stronger time domain feature advance and capability, the micro expression is an action in a short time, the action amplitude is tiny, and the feature in the time domain is also an indispensable part for distinguishing the micro expression with high precision.

In the invention, FACS (facial coding system) labeling is carried out on the labeled micro-expression video samples and macro-expression video samples obtained in the steps, and the FACS labeling is used for training a feature extraction network model. When the feature extraction network model is trained, the used labels are FACS (facial coding system) sequences, so that the model is more careful, accurate and sensitive to the micro-expression action change, and the model effect is better. The invention adopts macro expression data which is relatively easy to obtain and marks to train the feature extraction network model, and after the model converges, the micro expression sample data set is used for fine tuning the feature extraction network model to obtain the trained feature extraction network model.

In the invention, when the target video frame data is required to be identified, the target video frame data is directly input into the trained feature extraction network model, and the feature vector output by the ViViViT layer of the feature extraction network model shown in fig. 4 is obtained as a second feature vector.

In the present invention, the convolutional neural network classification model may also be replaced by other convolutional video classification models, such as 2dcnn+3dcnn models, I3D, TPN, etc. The encoder portion of the feature extraction network model may also use other models that use a transducer for video classification, such as Video Swin Transformer, vidTr, timeSformer, etc.

And 204, inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result.

Optionally, wherein the method further comprises:

Referring to fig. 5, the feature fusion classification network model of the present invention combines a convolutional neural network classification model and a feature extraction network model, and the structure of the feature fusion classification network model includes: the device comprises a first input layer (input 1), a first characteristic output layer (namely a convolutional neural network part), a characteristic splicing layer, a first full-connection layer (comprising FC1 and FC 2) and a softmax classification layer which are sequentially connected, and a second input layer (input 2), a second characteristic output layer (namely a VIVIT part), a flattening layer (flat) and a second full-connection layer (FC) which are sequentially connected, wherein the second full-connection layer is connected with the characteristic splicing layer; the first feature output layer includes: convolutional neural network layers, flattening layers, and fully connected layers in the convolutional neural network classification model.

According to the invention, a model fusion part is mainly used for sequentially carrying out flattening and full-connection operations on feature vectors output by a feature extraction network model, reducing the feature dimension to be consistent with the dimension output by a convolutional neural network classification model, then carrying out feature stitching, reducing and fusing features through FC1, and then classifying through FC2 and softmax to obtain a micro expression class. The convolutional neural network classification model output feature layer is an FC1 layer in the convolutional neural network classification model training model in fig. 1, and the feature extraction network model output feature is a feature of ViViT output.

Convolution is a local operation and a convolution layer generally only models the relationship between neighboring pixels. The transducer is global operation, one transducer layer can model the relation among all pixels, and the two sides can be well complemented, so that the invention adopts a CNN and Transformer (ViViT) combined mode to recognize the micro expression besides better extracting the spatial semantic relation and the time domain semantic relation.

In the invention, the convolutional neural network (the convolutional neural network classification model) and the ViViViT (the feature extraction network model) are required to be fixed with weights when the fusion model is trained, and the convolutional neural network classification model and the feature extraction network model are not optimized when the fusion model is trained. The samples used during training were microexpressive data sets, for the ViViT input of the original 16 frames of sample data (input 2 in fig. 5), for the convolutional neural network input of the sampled samples with original step size N (example n=2), the loss function used cross entropy until convergence.

During reasoning, the structure in the fusion model diagram is used, every 16 frames in the video are used as a model input, wherein the original 16 frames are used as input 2 in fig. 5, the sampled data with the step length of N is used as input 1 (example N=2) in fig. 5, and the result of model output is the micro-expression recognition result. Specifically, in the invention, the first feature vector and the second feature vector after the dimension reduction are input into a trained feature fusion classification network model, the second feature vector is subjected to flattening and full connection operation, and the feature dimension of the second feature vector is reduced to be the same as the feature dimension of the first feature vector so as to obtain the second feature vector after the dimension reduction; performing feature stitching on the first feature vector and the second feature vector subjected to dimension reduction to obtain stitching features; inputting the spliced features to a full-connection layer, and carrying out micro-expression classification and identification based on a softmax classification layer to obtain a micro-expression identification result.

As described in the foregoing background, when CNN is used for identification and classification in the prior art, the CNN cannot utilize relevant information of the video image sequence in the time domain (in the feature input layer of the CNN, the correlation between each feature is not represented, and neurons of the input layer are equivalent). That is, CNN can only identify a single image frame in video image information and cannot learn the changes or associations between adjacent image frames. While a microexpressive is a motion that a customer presents to a localized area of the face over a short period of time. The related information in the time domain is also a very important part for identifying and distinguishing micro-expressions. Therefore, ignoring the relevant information in the time domain may result in a degradation of the micro-expression recognition performance by CNN.

Aiming at the problems in the background technology, the embodiment acquires target video frame data containing micro expressions; inputting the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector; inputting the target video data into a trained feature extraction network model to obtain a second feature vector; and inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result. The invention combines the CNN and the transducer model, takes the space semantic and time domain semantic information into consideration, adopts high-resolution and low-frame-rate data input for the CNN model which is biased to the capacity of extracting the space semantic information, adopts low-resolution and high-frame-rate data input for the transducer model, can efficiently and accurately perform microexpressive recognition based on the invention, and can be applied to different application scenes.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium 104. The storage medium 104 includes a stored program, wherein the method of any of the above is performed by a processor when the program is run.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

Example 2

Fig. 6 shows a video-based micro-expression recognition apparatus according to the present embodiment, which corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 3, the apparatus includes:

a data acquisition module 301, configured to acquire target video frame data including a micro expression;

the first feature vector obtaining module 302 is configured to input the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector;

a second feature vector obtaining module 303, configured to input the target video data into a trained feature extraction network model to obtain a second feature vector;

and the recognition module 304 is configured to input the first feature vector and the second feature vector to a trained feature fusion classification network model to perform recognition of the micro-expression, and obtain a micro-expression recognition result.

Optionally, the first feature vector obtaining module 302 further includes:

Optionally, the value range of the preset step length N is [1,8].

Optionally, wherein the apparatus further comprises: the convolutional neural network classification model training module is used for:

Optionally, wherein the apparatus further comprises: the feature extraction network model training module is used for:

Optionally, the identifying module inputs the first feature vector and the second feature vector to a trained feature fusion classification network model to identify the micro-expression, and obtains a micro-expression identification result, including:

Optionally, wherein the apparatus further comprises: the feature fusion classification network model training module is used for:

Thus, according to the present embodiment, target video frame data including micro-expressions can be acquired; inputting the target video frame data into a trained convolutional neural network classification model to obtain a first feature vector; inputting the target video data into a trained feature extraction network model to obtain a second feature vector; and inputting the first feature vector and the second feature vector into a trained feature fusion classification network model to identify the micro-expression, and obtaining a micro-expression identification result. The invention combines the CNN and the transducer model, takes the space semantic and time domain semantic information into consideration, adopts high-resolution and low-frame-rate data input for the CNN model which is biased to the capacity of extracting the space semantic information, adopts low-resolution and high-frame-rate data input for the transducer model, can efficiently and accurately perform microexpressive recognition based on the invention, and can be applied to different application scenes.

Example 3

Fig. 7 shows a video-based micro-expression recognition apparatus 400 according to the present embodiment, the apparatus 400 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a processor 410; and a memory 420 coupled to the processor 410 for providing instructions to the processor 410 for processing the following processing steps:

acquiring target video frame data containing micro expressions;

Optionally, wherein the method further comprises:

Optionally, the value range of the preset step length N is [1,8].

Optionally, wherein the method further comprises:

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A video-based micro-expression recognition method, the method comprising:

acquiring target video frame data containing micro expressions;

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 1, wherein the preset step size N has a value in the range of [1,8].

4. The method according to claim 1, wherein the method further comprises:

5. The method according to claim 1, wherein the method further comprises:

6. The method of claim 1, wherein the inputting the first feature vector and the second feature vector into the trained feature fusion classification network model to perform the recognition of the micro-expression, and obtaining the micro-expression recognition result comprises:

7. The method according to claim 1, wherein the method further comprises:

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. A video-based micro-expression recognition apparatus, the apparatus comprising:

10. A video-based micro-expression recognition apparatus, the apparatus comprising:

a processor; and

acquiring target video frame data containing micro expressions;