CN116824677B - Expression recognition method and device, electronic equipment and storage medium - Google Patents

Expression recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116824677B
CN116824677B CN202311090087.1A CN202311090087A CN116824677B CN 116824677 B CN116824677 B CN 116824677B CN 202311090087 A CN202311090087 A CN 202311090087A CN 116824677 B CN116824677 B CN 116824677B
Authority
CN
China
Prior art keywords
sample
video frame
expression
video
face video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311090087.1A
Other languages
Chinese (zh)
Other versions
CN116824677A (en
Inventor
吴双
王晗阳
丁守鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311090087.1A priority Critical patent/CN116824677B/en
Publication of CN116824677A publication Critical patent/CN116824677A/en
Application granted granted Critical
Publication of CN116824677B publication Critical patent/CN116824677B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application discloses an expression recognition method, an expression recognition device, electronic equipment and a storage medium. The method of the embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method comprises the following steps: acquiring a face video to be identified; inputting the face video to be recognized into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories; and determining the expression category with the highest probability based on the probability that the face video to be identified belongs to different expression categories, and taking the expression category with the highest probability as an expression identification result of the face video to be identified. According to the application, the adjustment parameters determined according to the consistency index and the loss value can accurately indicate the learning trend of the initial model on the sample face video, can strengthen the learning of the initial model on the difficult sample, and weakens the learning of the initial model on the noise sample, so that the expression recognition model obtained through training can ensure the accurate expression recognition capability and the accuracy of the expression recognition result on the difficult sample.

Description

Expression recognition method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to an expression recognition method, an expression recognition device, an electronic device, and a storage medium.
Background
Facial expressions are important in person-to-person conversation communications. People understand the emotion state of the opponent at the moment by understanding the facial expression of the opponent. Therefore, how to accurately recognize facial expressions is a task to be solved.
With the development of man-machine interaction and artificial intelligence, more and more neural network models are developed, and facial videos to be identified can be identified through an expression identification model obtained after training the neural network models, so that an expression identification result of the facial videos to be identified is obtained.
At present, the dynamic expression of a human face can be acquired in a real environment to obtain an acquired image, the acquired image is marked in a manual marking mode to obtain a marked sample image, and the initial model is trained through the marked sample image to obtain an expression recognition model.
However, the recognition effect of the expression recognition model trained by the above method is poor, resulting in low accuracy of the expression recognition result determined by the expression recognition model.
Disclosure of Invention
In view of the above, the embodiments of the present application provide an expression recognition method, an expression recognition device, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present application provides an expression recognition method, including: acquiring a face video to be identified; inputting the face video to be recognized into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories, training an initial model by the expression recognition model through a target loss value corresponding to the sample face video, wherein the target loss value is obtained after the loss value of the sample face video is regulated through a regulating parameter corresponding to the sample face video, the loss value is used for indicating the accuracy of the initial model in carrying out expression recognition on the sample face video, the regulating parameter is determined through a consistency index corresponding to the sample face video and the loss value, and the consistency index is used for indicating the consistency degree of the initial model aiming at the expression categories recognized by different sample video frame sequences in the sample face video; and determining the expression category with the highest probability based on the probability that the face video to be identified belongs to different expression categories, and taking the expression category with the highest probability as an expression identification result of the face video to be identified.
In a second aspect, an embodiment of the present application provides an expression recognition apparatus, including: the acquisition module is used for acquiring the face video to be identified; the recognition module is used for inputting the face video to be recognized into the expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories, the expression recognition model is obtained by training an initial model through a target loss value corresponding to the sample face video, the target loss value is obtained after the loss value of the sample face video is adjusted through an adjustment parameter corresponding to the sample face video, the loss value is used for indicating the accuracy of the initial model for carrying out expression recognition on the sample face video, the adjustment parameter is determined through a consistency index corresponding to the sample face video and the loss value, and the consistency index is used for indicating the consistency degree of the initial model for the expression categories recognized by different sample video frame sequences in the sample face video; the determining module is used for determining the expression category with the highest probability based on the probability that the face video to be recognized belongs to different expression categories, and the expression category is used as the expression recognition result of the face video to be recognized.
Optionally, the apparatus further comprises a training module for determining a plurality of sample video frame sequences from the sample face video; determining the probability that each sample video frame sequence belongs to different expression categories through an initial model; according to the probability that each sample video frame sequence belongs to different expression categories, determining a first expression category to which each sample video frame sequence belongs and a second expression category to which a sample face video belongs; determining a consistency index corresponding to the sample face video according to the duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; determining a loss value of the sample face video according to a second expression category to which the sample face video belongs; determining an adjustment parameter according to the loss value and the consistency index; according to the adjustment parameters, the loss values are adjusted to obtain target loss values corresponding to the sample face video; and adjusting parameters of the initial model according to the target loss value to obtain the expression recognition model.
Optionally, the training module is further configured to obtain the first value as the adjustment parameter if the consistency index is lower than the index threshold; if the consistency index is not lower than the index threshold and the loss value is not higher than the loss value index, acquiring a second value as an adjustment parameter; the first value is higher than the second value; if the consistency index is not lower than the index threshold and the loss value is higher than the loss value index, acquiring a third numerical value as an adjustment parameter; the second value is higher than the third value.
Optionally, the training module is further configured to calculate a product of the loss value and the adjustment parameter, as a target loss value corresponding to the sample face video.
Optionally, the training module is further configured to determine a first expression category with a highest duty ratio according to a duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; and taking the duty ratio of the first expression category with the highest duty ratio as a consistency index corresponding to the sample video.
Optionally, the training module is further configured to use an expression class with the highest probability among probabilities that the sample video frame sequence belongs to different expression classes as a first expression class to which the sample video frame sequence belongs; for each expression category, calculating the average value of probabilities that a plurality of sample video frame sequences belong to the expression category as the probability that the sample face video belongs to the expression category; and taking the expression category with the highest probability in the probabilities that the sample face video belongs to different expression categories as a second expression category to which the sample face video belongs.
Optionally, the apparatus further comprises a preprocessing module, configured to sample a plurality of input frames from the face video to be identified; determining a plurality of video frame sequences from a plurality of input frames; the recognition module is also used for inputting a plurality of video frame sequences into the expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories.
Optionally, the preprocessing module is further configured to perform feature extraction on a plurality of input frames through a backbone network to obtain input features; determining a position parameter according to the input characteristics, wherein the position parameter is used for indicating the position of a key frame in the face video to be identified, and the key frame is a video frame presenting a key expression; sampling in the face video to be identified according to the position parameters to obtain a plurality of target face video frames; and constructing a plurality of video frame sequences according to the plurality of target face video frames.
Optionally, the position parameter includes a center position and a step size corresponding to the key frame; the preprocessing module is also used for determining a central key video frame according to the central position; taking the central key video frame as the center, and sampling in the face video to be identified according to the step length to obtain a plurality of sampling video frames; and summarizing the plurality of sampling video frames and the central key video frame to obtain a plurality of target face video frames.
Optionally, the expression recognition model comprises a feature extraction network and a classification layer; the identification module is also used for determining the target characteristic of each video frame sequence based on a plurality of video frame sequences through the characteristic extraction network; determining the probability that each video frame sequence belongs to different expression categories according to the target characteristics of each video frame sequence through a classification layer; and determining the probability that the face video to be recognized belongs to different expression categories according to the probabilities that the video frame sequences belong to different expression categories.
Optionally, the feature extraction network includes a base feature extraction network, a first feature extraction network, and a second feature extraction network; the method comprises the steps that all target face video frames in a video frame sequence are arranged according to the arrangement sequence of all target face video frames in a face video to be identified, and no target face video frames in other video frame sequences exist between any adjacent target face video frames in the video frame sequence; the identification module is also used for constructing a plurality of auxiliary video frame sequences according to the plurality of video frame sequences, wherein different target face video frames in the auxiliary video frame sequences belong to different video frame sequences; extracting the characteristics of each video frame sequence through a basic characteristic extraction network, and taking the characteristics as initial characteristics corresponding to each video frame sequence; extracting the characteristics of each auxiliary video frame sequence through a basic characteristic extraction network, and taking the characteristics as initial characteristics corresponding to each auxiliary video frame sequence; performing motion coding on the initial feature corresponding to each video frame sequence by using a first feature extraction network to obtain motion coding features corresponding to each video frame sequence; performing expression coding by a second feature extraction network based on action coding features corresponding to a plurality of video frame sequences to obtain intermediate features corresponding to each video frame sequence; performing expression coding by a second feature extraction network based on initial features corresponding to each auxiliary video frame sequence to obtain expression coding features corresponding to each auxiliary video frame sequence; performing motion coding on expression coding feature rows corresponding to a plurality of auxiliary video frame sequences by a first feature extraction network to obtain intermediate features corresponding to each auxiliary video frame sequence; and determining the target characteristic of each video frame sequence according to the intermediate characteristics corresponding to the video frame sequences and the intermediate characteristics corresponding to the auxiliary video frame sequences.
Optionally, the identification module is further configured to splice the intermediate features corresponding to the multiple video frame sequences and the intermediate features corresponding to the multiple auxiliary video frame sequences to obtain an intermediate splicing result corresponding to each video frame sequence; acquiring a global abstract of a face video to be identified; and splicing the global abstract with the intermediate splicing result corresponding to each video frame sequence respectively to obtain the target feature corresponding to each video frame sequence.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the methods described above.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, wherein the program code, when executed by a processor, performs the method described above.
In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the electronic device to perform the method described above.
In the method, the device, the electronic equipment and the storage medium for recognizing the expression, the consistency index is used for indicating the consistency degree of the expression categories recognized by the initial model aiming at different sample video frame sequences in the sample face video, generally, the expression recognition results of a plurality of sample video frame sequences from the same sample face video are consistent theoretically, namely, the consistency index is higher, but if the consistency index of the expression recognition results of a plurality of sample video frame sequences from the same sample face video is lower, the sample face video can be determined to be a difficult sample, so that the consistency index can reflect the difficulty degree of carrying out the expression recognition on the sample face video. The loss value is used for indicating the accuracy of the initial model for carrying out expression recognition on the sample face video, generally, if the consistency index of the expression recognition results of a plurality of sample video frame sequences derived from the same sample face video is higher, but the loss value is also higher, the possible reason is that the sample face video is marked incorrectly, namely the sample face video is a noise sample, based on the principle, whether the sample face video is a difficult sample or a noise sample is recognized by combining the consistency index and the loss value, and the adjustment parameter is determined pertinently so as to adjust the loss value based on the adjustment parameter, thereby enhancing the learning of the initial model on the difficult sample and weakening the learning of the initial model on the noise sample, further ensuring the accurate expression recognition capability of the training-obtained expression recognition model on the difficult sample, effectively reducing the recognition accuracy of the expression recognition model influenced by the noise sample, solving the problems of confusion and difficult sample mining caused by the noise sample in dynamic expression recognition, and further ensuring the expression recognition accuracy of the recognition model obtained by training according to the target loss value. In addition, when the expression recognition model is obtained through training by the method, the method does not depend on additional information except the samples, and can realize accurate distinction of the conventional samples, the noise samples and the difficult samples under the condition of simultaneously having the noise samples and the difficult samples, so that the problems of the noise samples and the difficult samples can be better processed, and the method is easy to popularize in the training process of large-scale real dynamic facial expression data, and the performance of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a schematic diagram of an application scenario to which an embodiment of the present application is applicable;
FIG. 2 is a flowchart of a training process of an expression recognition model in an embodiment of the present application;
FIG. 3 is a schematic diagram of a sample video frame sequence and an auxiliary sample video frame sequence in accordance with an embodiment of the present application;
FIG. 4 is a schematic diagram showing a loss value adjustment process according to an embodiment of the present application;
FIG. 5 is a flowchart of an expression recognition method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a video frame sequence acquisition process in accordance with an embodiment of the present application;
FIG. 7 is a schematic diagram of an expression recognition process in an implementation of the present application;
fig. 8 is a block diagram of an expression recognition apparatus according to an embodiment of the present application;
Fig. 9 shows a block diagram of an electronic device for performing an expression recognition method according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the application, are within the scope of the application in accordance with embodiments of the present application.
In the following description, the terms "first", "second", and the like are merely used to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", or the like may be interchanged with one another, if permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The application discloses an expression recognition method, an expression recognition device, electronic equipment and a storage medium, and relates to an artificial intelligence technology.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Currently, in the facial expression recognition field, facial expression recognition can also be performed through manually designed features or shallow learning technologies (such as local binary features (Local Binary Patterns, LBP), non-negative matrix factorization (Nonnegative Matrix Factorization, NMF), sparse learning and the like).
However, as the processing capacity of the GPU (graphics processing unit, graphics processor) and other chips is greatly improved and the deep neural network technology is developed, the deep neural network and the deep learning method are utilized to train the initial model by means of large-scale facial expression sample data to obtain the expression recognition model, and the expression recognition efficiency and accuracy can be greatly improved through the expression recognition model.
However, when facial expression sample data is collected, facial expression data collected under ideal laboratory conditions is generally used as facial expression sample data. The quantity of facial expression data collected in the real environment is sufficient, so that the training of the expression recognition model is more sufficient. However, the labeling difficulty of a large amount of facial expression data is higher, different labeling personnel label the facial expression data respectively, however, the quality of the different labeling personnel is different, errors often exist in labels of the facial expression data, so that noise samples are generated, meanwhile, the real environment is complex and changeable, the facial expression data comprises a plurality of difficult-to-distinguish high-difficulty samples, when an initial model is trained through the noise samples and the difficult-to-distinguish high-difficulty samples, the initial model is difficult to distinguish the noise samples and the difficult-to-distinguish high-difficulty samples, the recognition effect of an expression recognition model obtained through training is poor, and the accuracy of an expression recognition result determined through the expression recognition model is low.
Based on the method, the consistency index is used for indicating the consistency degree of the initial model for the expression categories identified by different sample video frame sequences in the sample face video, the consistency index accurately indicates the difficulty degree of the sample face video, the loss value is used for indicating the accuracy of the initial model for carrying out the expression identification on the sample face video, the loss value and the consistency index can accurately indicate whether the sample face video is a noise sample, the adjustment parameters determined according to the consistency index and the loss value can accurately indicate the learning trend of the initial model on the sample face video, the learning degree of the initial model on the sample face video can be accurately indicated according to the target loss value after the adjustment parameters, the problem of confusion and difficult sample mining caused by the noise sample in dynamic expression identification is solved, the identification effect of the expression identification model after training according to the target loss value is improved, and the accuracy of the expression identification result corresponding to the face video to be identified according to the expression identification model is further improved.
As shown in fig. 1, an application scenario to which the embodiment of the present application is applicable includes a terminal 20 and a server 10, where the terminal 20 and the server 10 are connected through a wired network or a wireless network. The terminal 20 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart home appliance, a vehicle-mounted terminal, an aircraft, a wearable device terminal, a virtual reality device, and other terminal devices capable of page presentation, or other applications (e.g., instant messaging applications, shopping applications, search applications, game applications, forum applications, map traffic applications, etc.) capable of invoking page presentation applications.
The server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like.
The terminal 20 may send the face video to be identified to the server 10, the server 10 may identify the face video to be identified according to the expression identification model to obtain an expression identification result of the face video to be identified, and then the server 10 returns the expression identification result of the face video to be identified to the terminal 20.
The server 10 may train the initial model according to the sample face video, to obtain a trained expression recognition model. The expression recognition model is obtained by training an initial model through a target loss value corresponding to the sample face video, the target loss value is obtained after the loss value of the sample face video is adjusted through an adjusting parameter corresponding to the sample face video, the loss value is used for indicating the accuracy of the initial model for carrying out expression recognition on the sample face video, the adjusting parameter is determined through a consistency index corresponding to the sample face video and the loss value, and the consistency index is used for indicating the consistency degree of expression categories identified by the initial model aiming at different video frames in the sample face video.
The expression category may refer to an expression of a face, for example, the expression category may include happiness, heart injury, difficulty, surprise, and the like.
In another embodiment, the terminal 20 may identify the face video to be identified according to the expression identification model, so as to obtain an expression identification result of the face video to be identified.
It may be appreciated that the server 10 may store the trained expression recognition model in the cloud storage system, and the terminal 20 obtains the expression recognition model from the cloud storage system, so as to recognize the facial video to be recognized according to the expression recognition model after the terminal 20 obtains the expression recognition model, so as to obtain the expression recognition result of the facial video to be recognized.
For convenience of description, in the following embodiments, description will be made in terms of an example in which expression recognition is performed by an electronic device.
Referring to fig. 2, fig. 2 is a flowchart illustrating a training process of an expression recognition model according to an embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
s110, acquiring sample face videos, and determining a plurality of sample video frame sequences according to the sample face videos.
The sample face video may refer to a video for training to obtain an expression recognition model and including a dynamic expression of a face, for example, the sample face video may be a video presenting a person crying, and for example, the sample face video may be a video presenting a person laughing.
In general, in order to avoid that the number of sample face videos including the face dynamic expressions is too large, the length of the sample face video should not be too long, for example, the sample face video may be a video with a duration of 3s-5 s. If the acquired video including the facial dynamic expression is later, the acquired video can be cut into a plurality of videos with smaller duration if the video duration is longer, and each cut video can be respectively used as a facial video. For example, the video including the dynamic expression of the face is 20s, and the video can be uniformly cut into 5 videos of 4s, and each video of 4s is taken as a sample face video.
After the sample face video is obtained, the labeling personnel can recognize the face dynamic expression in the sample face video so as to determine the expression category of the face dynamic expression in the sample face video, and determine the sample label of the sample face video according to the expression category of the face dynamic expression in the sample face video.
In some embodiments, for each expression category, if the facial dynamic expression in the sample facial video belongs to the expression category, the expression category may be assigned with 1, if the facial dynamic expression in the sample facial video belongs to the expression category, the expression category may be assigned with 0, all the expression categories are traversed to obtain the assignment of each expression category to the sample facial video, the assignment of each expression category to the sample facial video is summarized, the sample label of the sample facial video is obtained, and the sample label may be digitized into a vector form.
For example, the sample label of the sample face video a2 is a vector b2 (1, 0), the sample label of the sample face video a3 is a vector b3 (0, 1, 0), 1 in the vector b2 indicates that the dynamic facial expression in the sample face video is happy, the first 0 from left to right in the vector b2 indicates that the dynamic facial expression in the sample face video is not sad, and the second 0 from left to right in the vector b2 indicates that the dynamic facial expression in the sample face video is not anger; the first 0 from left to right in the vector b3 indicates that the dynamic facial expression in the sample facial video does not belong to happiness, 1 in the vector b3 indicates that the dynamic facial expression in the sample facial video belongs to sadness, and the second 0 from left to right in the vector b3 indicates that the dynamic facial expression in the sample facial video does not belong to anger.
In this embodiment, the sample face video may be divided to obtain a plurality of divided videos, where a sequence formed by a plurality of video frames in each divided video is used as a sample video frame sequence, and the sample face video may be uniformly divided or unevenly divided to obtain a plurality of sample video frame sequences. For example, 4s of sample face video is uniformly divided into 4 1s of video, and 24 video frames in each 1s of video are taken as a sample video frame sequence.
In this embodiment, a plurality of non-overlapping video segments may be captured from the sample face video, where a plurality of video frames in each video segment form a sample video frame sequence. For example, a video clip of 0 th to 0.5 th s is cut from a sample face video of 4s, a plurality of video frames in the video clip form a sample video frame sequence, a video clip of 1 st to 1.5 th s is cut, a plurality of video frames in the video clip form a sample video frame sequence, a video clip of 3 rd to 3.5 th s is cut, and a plurality of video frames in the video clip form a sample video frame sequence, thereby obtaining 3 sample video frame sequences.
In some embodiments, a plurality of sample input frames may also be sampled from the sample face video, and a plurality of sample video frame sequences may be determined from the plurality of sample input frames. The method comprises the steps of uniformly sampling a plurality of sample input frames from a sample face video, and dividing the plurality of sample input frames into a plurality of groups according to the sequence of the plurality of sample input frames in the sample face video, wherein the plurality of sample input frames in each group form a sample video frame sequence.
For example, the sample face video includes 200 video frames, 50 video frames are sampled from the sample face video at intervals of 4 frames, the sampled 50 video frames are used as sample input frames, the 50 video frames are uniformly divided into 10 groups according to the sequence of the 50 sample input frames in the sample face video, each group includes 5 sample input and output frames, and the 5 sample input frames in each group form a sample video frame sequence, wherein the sample input frames in the first video frame sequence refer to the first video frame, the sixth video frame, the twelfth video frame, the eighteenth video frame and the twenty-fourth video frame in the sample face video.
Optionally, determining the plurality of sample video frame sequences from the plurality of sample input frames may include: extracting characteristics of a plurality of sample input frames through a backbone network to obtain sample input characteristics; determining a sample position parameter according to sample input characteristics, wherein the sample position parameter is used for indicating the position of a key sample frame in a sample face video, and the key sample frame is a video frame presenting a key sample expression; sampling in a sample face video according to the sample position parameters to obtain a plurality of target sample face video frames; and constructing a plurality of sample video frame sequences according to the plurality of target sample face video frames.
The backbone network may be a 2-dimensional convolutional neural network, a 3-dimensional convolutional neural network, or a visual transducer network for video recognition. The multiple sample input frames can be input into a backbone network for feature extraction, the backbone network outputs sample input features, then the sample input features can be input into a full connection layer, the full connection layer processes the sample input features to obtain sample position parameters, and the sample position parameters can comprise sample center positions and sample step sizes.
The key sample expression refers to an expression type with highest appearance frequency in a sample face video or an expression type with highest probability in a plurality of sample input frames, and the sample position parameter is used for indicating the position of a key sample frame presenting the key sample expression in the plurality of sample input frames, so that the key sample frame is conveniently determined from the plurality of sample input frames according to the sample position parameter, and is used as a target sample face video frame for training an initial model, and the training effect of the initial model is improved.
Meanwhile, the key sample expression is actually the expression category predicted according to the plurality of sample input frames through the main network and the full connection layer, and the sample position parameter is the parameter predicted according to the plurality of sample input frames through the main network and the full connection layer, so that the key sample expression and the sample position parameter do not need to be manually determined, and the acquisition efficiency of the sample position parameter of the key sample expression is greatly improved.
After obtaining the sample position parameters, resampling a plurality of video frames in the sample face video through the sample position parameters, namely, sampling from a plurality of sample input frames based on the sample position parameters to obtain key sample frames presenting key sample expressions, wherein each key sample frame obtained through resampling serves as a target sample face video frame. The process of resampling a plurality of target sample face video frames in a sample face video according to sample position parameters may include: determining a central key sample frame according to the central position of the sample; taking the central key sample frame as the center, and sampling the sample face video according to the sample step length to obtain a plurality of sample sampling video frames; and summarizing the plurality of sample sampling video frames and the central key sample frames to obtain a plurality of target sample face video frames.
The sample center position is used for indicating the position of the key sample frame at the center position in the sample face video, so that the product of the total number of input video frames and the sample center position can be used as the frame sequence number of the key sample frame at the center position in the sample input video frame sequence, and the center key sample frame can be determined. And then taking the central key sample frame as a center, and respectively sampling video frames positioned at two sides of the central key sample frame in the sample face video according to the sample step length to obtain a plurality of sample sampling video frames. The sample step size may refer to the difference in frame numbers of two adjacent video frames that are resampled.
For example, the number of video frames in the sample face video is 64, the sample center position is 0.5, the sample step length is 4, the frame number of the center key sample frame in the sample face video is 32, the 32 nd video frame in the sample face video is determined as the center key sample frame, at this time, the frame numbers of the sampled sample video frames in the sample face video are 4, 8, …,28, 36, 40, …, 60 and 64, respectively, and the key sample frames and the plurality of sample video frames are summarized to obtain a plurality of target sample face video frames, and the frame numbers of the plurality of target sample face video frames in the sample face video are 4, 8, …,28, 32, 36, 40, …, 60 and 64.
After obtaining the plurality of target sample face video frames, the plurality of target sample face video frames can be divided according to the arrangement sequence of the plurality of target sample face video frames in the sample face video, so as to obtain a plurality of sample video frame sequences. The method comprises the steps of uniformly dividing a plurality of target sample face video frames to obtain a plurality of groups, and arranging the plurality of target sample face video frames in each group according to the arrangement sequence of the target sample face video frames in the sample face video to obtain a sample video frame sequence.
For example, the frame numbers of the plurality of target sample face video frames in the sample face video are 4, 8, …, 28, 32, 36, 40, …, 60 and 64, the plurality of target sample face video frames are 16, the plurality of target sample face video frames are divided into 4 groups according to the arrangement order of the plurality of target sample face video frames in the sample face video, each group includes 4 target sample face video frames, the 4 target sample face video frames of each group serve as one sample video frame sequence, the four target sample face video frames of the frame numbers 4, 8, 12 and 16 are arranged in sequence to obtain one sample video frame sequence, the four target sample face video frames of the frame numbers 20, 24, 28 and 32 are arranged in sequence to obtain one sample video frame sequence, the four target sample face video frames of the frame numbers 36, 40, 44 and 48 are arranged in sequence to obtain one sample video frame sequence, and the four target sample face video frames of the frame numbers 52, 56, 60 and 64 are arranged in sequence to obtain one sample video frame sequence.
S120, determining the probability that each sample video frame sequence belongs to different expression categories through an initial model.
The sample video frame sequence can be input into an initial model, the initial model is used for extracting the characteristics of the sample video frame sequence, and the probability that the sample video frame sequence belongs to different expression categories is predicted according to the extracted characteristics, namely the probability that the sample video frame sequence belongs to each expression category is predicted. For example, the expression categories to be recognized by the initial model may be set to include happiness, sadness, and anger, the sample video frame sequence is input to the initial model, the initial model predicts that the output sample video frame sequence belongs to the happiness with a probability of p1, the sample video frame sequence belongs to the sadness with a probability of p2, and the sample video frame sequence belongs to the anger with a probability of p3.
The initial model may refer to a model for training parameter initialization for obtaining an expression recognition model, and the initial model may include an initial feature extraction network for performing feature extraction on an input sample video frame sequence, and an initial classification layer for classifying features output by the initial feature extraction network to determine a probability that the input sample video frame sequence belongs to each expression class. It should be noted that after the training of the initial model is finished, the initial feature extraction network in the initial model is correspondingly used as a feature extraction network for extracting features of the input video frame sequence in the application stage, and similarly, the initial classification layer in the initial model is correspondingly used as a classification layer for performing expression classification.
The initial feature extraction network may be constructed by a convolutional neural network, a long-short-term memory neural network, or the like, and is not particularly limited herein.
In some embodiments, the target sample feature of each sample video frame sequence may be determined through an initial feature extraction network in the initial model, and then the target sample feature of each sample video frame sequence is fully connected through an initial classification layer, so as to determine the probability that each sample video frame sequence belongs to different expression categories.
In some embodiments, the initial feature extraction network may include an initial base feature extraction network, an initial first feature extraction network, and an initial second feature extraction network, and the initial classification layer may be a fully connected layer of parameter initialization. The initial underlying feature extraction network may be a parameter initialized resnet50 network, and the initial first feature extraction network and the initial second feature extraction network may each be a parameter initialized bistm (Bidirectional Long Short-Term Memory network).
In the case where the initial feature extraction network may include an initial base feature extraction network, an initial first feature extraction network, and an initial second feature extraction network, the determining of the target sample feature for each sample video frame sequence may include: constructing a plurality of auxiliary sample video frame sequences according to the plurality of sample video frame sequences, wherein different target sample face video frames in the auxiliary sample video frame sequences belong to different sample video frame sequences; extracting the characteristics of each sample video frame sequence through an initial basic characteristic extraction network, and taking the characteristics as initial sample characteristics corresponding to each sample video frame sequence; extracting the characteristics of each auxiliary sample video frame sequence through an initial basic characteristic extraction network, and taking the characteristics as initial sample characteristics corresponding to each auxiliary sample video frame sequence; performing motion coding on initial sample features corresponding to each sample video frame sequence by an initial first feature extraction network to obtain motion coding features corresponding to each sample video frame sequence; performing expression coding by the initial second feature extraction network based on the motion coding features corresponding to each sample video frame sequence to obtain intermediate features corresponding to each sample video frame sequence; performing expression coding by an initial second feature extraction network based on initial sample features corresponding to each auxiliary sample video frame sequence to obtain expression coding features corresponding to each auxiliary sample video frame sequence; performing motion coding on expression coding feature rows corresponding to each auxiliary sample video frame sequence by an initial first feature extraction network to obtain intermediate features corresponding to each auxiliary sample video frame sequence; and determining the target sample characteristic of each sample video frame sequence according to the intermediate characteristics corresponding to the plurality of sample video frame sequences and the intermediate characteristics corresponding to the plurality of auxiliary sample video frame sequences.
For each sample video frame sequence, each target sample face video frame in the sample video frame sequence is arranged according to the arrangement sequence of each target sample face video frame in the sample face video, and no target sample face video frame in other sample video frame sequences exists between any adjacent target sample face video frames in the sample video frame sequence. For example, four target sample face video frames with frame numbers 4, 8, 12, and 16 are sequentially arranged to obtain a sample video frame sequence L1, four target sample face video frames with frame numbers 20, 24, 28, and 32 are sequentially arranged to obtain a sample video frame sequence L2, four target sample face video frames with frame numbers 36, 40, 44, and 48 are sequentially arranged to obtain a sample video frame sequence L3, and four target sample face video frames with frame numbers 52, 56, 60, and 64 are sequentially arranged to obtain a sample video frame sequence L4.
A plurality of auxiliary sample video frame sequences can be constructed according to the plurality of sample video frame sequences, and different target sample face video frames in the auxiliary sample video frame sequences belong to different sample video frame sequences. An auxiliary sample video frame sequence can be constructed according to the target sample face video frames with the same sequence number in each sample video frame sequence (the sequence number of the target sample face video frames in the sample video frame sequence), or an auxiliary sample video frame sequence can be constructed by randomly selecting one target sample face video frame in each sample video frame sequence. For example, an auxiliary sample video frame sequence may be constructed from a first target sample face video frame in each sample video frame sequence, and for example, a target sample face video frame with a sequence number of 1 in the 1 st sample video frame sequence, a target sample face video frame with a sequence number of 2 in the 2 nd sample video frame sequence, a target sample face video frame with a sequence number of 3 in the 3 rd sample video frame sequence, and a target sample face video frame with a sequence number of 4 in the 4 th sample video frame sequence.
For example, as shown in fig. 3, four target sample face video frames with frame numbers 4, 8, 12 and 16 are sequentially arranged to obtain a sample video frame sequence L1, four target sample face video frames with frame numbers 20, 24, 28 and 32 are sequentially arranged to obtain a sample video frame sequence L2, four target sample face video frames with frame numbers 36, 40, 44 and 48 are sequentially arranged to obtain a sample video frame sequence L3, and four target sample face video frames with frame numbers 52, 56, 60 and 64 are sequentially arranged to obtain a sample video frame sequence L4. The method comprises the steps of obtaining an auxiliary sample video frame sequence GL1 corresponding to a target sample face video frame construction sample video frame sequence L1 with frame numbers of 4, 20, 36 and 52, obtaining an auxiliary sample video frame sequence GL2 corresponding to a target sample face video frame construction sample video frame sequence L2 with frame numbers of 8, 24, 40 and 56, obtaining an auxiliary sample video frame sequence GL3 corresponding to a target sample face video frame construction sample video frame sequence L3 with frame numbers of 12, 28, 44 and 60, and obtaining an auxiliary sample video frame sequence GL4 corresponding to a target sample face video frame construction sample video frame sequence L4 with frame numbers of 16, 32, 48 and 64.
And inputting the sample video frame sequence into an initial basic feature extraction network to obtain initial sample features corresponding to the sample video frame sequence, and inputting the auxiliary sample video frame sequence into the initial basic feature extraction network to obtain initial sample features corresponding to the auxiliary sample video frame sequence.
Then, performing motion coding on initial sample features corresponding to the sample video frame sequence through an initial first feature extraction network, and performing expression coding on motion coding results through an initial second feature extraction network to obtain sample features corresponding to the sample video frame sequence; and simultaneously, carrying out expression coding on initial sample features corresponding to the auxiliary sample video frame sequence through an initial second feature extraction network, and then carrying out motion coding on the expression coding features through an initial first feature extraction network to obtain sample features corresponding to the auxiliary sample video frame sequence. Traversing all the sample video frame sequences and the auxiliary sample video frame sequences to obtain intermediate features corresponding to the plurality of video frame sequences and intermediate features corresponding to the plurality of auxiliary video frame sequences, and determining target sample features corresponding to each sample video frame sequence according to the intermediate features corresponding to the plurality of video frame sequences and the intermediate features corresponding to the plurality of auxiliary video frame sequences.
The initial sample characteristics corresponding to the sample video frame sequence comprise characteristics of each target sample face video frame in the sample video frame sequence, the characteristics of each target sample face video frame in the sample video frame sequence can be input into an initial first characteristic extraction network for performing motion coding to obtain motion coding results of each target sample face video frame in the sample video frame sequence, the motion coding results of each of a plurality of target sample face video frames in the sample video frame sequence are averaged to obtain motion coding characteristics corresponding to the sample video frame sequence, all sample video frame sequences are traversed to obtain motion coding characteristics corresponding to each of a plurality of sample video frame sequences, then the motion coding characteristics corresponding to the sample video frame sequences are input into an initial second characteristic extraction network, and the motion coding characteristics corresponding to the sample video frame sequences are subjected to expression coding by the initial second characteristic extraction network to obtain intermediate characteristics corresponding to the sample video frame sequences. When the initial second feature extraction network encodes the motion coding features corresponding to each sample video frame sequence, the motion coding features corresponding to other sample video frame sequences are referred to, that is, a plurality of motion coding features corresponding to a plurality of sample video frame sequences are input into the initial second feature extraction network at the same time, so as to obtain intermediate features corresponding to the sample video frame sequences.
Similarly, the initial sample features corresponding to the auxiliary sample video frame sequence include respective features of each target sample face video frame in the auxiliary sample video frame sequence, the respective features of each target sample face video frame in the auxiliary sample video frame sequence can be input into an initial second feature extraction network to perform expression coding, respective expression coding results of each target sample face video frame in the auxiliary sample video frame sequence are obtained, respective expression coding results of a plurality of target sample face video frames in the auxiliary sample video frame sequence are averaged to obtain expression coding features corresponding to the auxiliary sample video frame sequence, all the auxiliary sample video frame sequences are traversed to obtain respective expression coding features of a plurality of auxiliary sample video frame sequences, then the respective expression coding features of the plurality of auxiliary sample video frame sequences are input into an initial first feature extraction network, and the initial first feature extraction network performs motion coding on the respective expression coding features of the plurality of auxiliary sample video frame sequences to obtain respective intermediate features of the plurality of auxiliary sample video frame sequences. When the initial second feature extraction network performs expression coding on the respective features of each target sample face video frame in the auxiliary sample video frame sequence, the respective features of the other target sample face video frames in the auxiliary sample video frame sequence are referred to, that is, the features of a plurality of target sample face video frames in each auxiliary sample video frame sequence are input into the initial second feature extraction network at the same time, so that expression coding results corresponding to the target sample face video frames are obtained.
Obtaining intermediate features corresponding to the plurality of sample video frame sequences and intermediate features corresponding to the plurality of auxiliary sample video frame sequences; the intermediate features corresponding to the sample video frame sequences and the intermediate features corresponding to the auxiliary sample video frame sequences can be spliced to obtain spliced features, and the spliced features serve as target sample features corresponding to the sample video frame sequences participating in splicing.
As an embodiment, the intermediate feature corresponding to any one sample video frame sequence and the intermediate feature corresponding to any one auxiliary sample video frame sequence may be spliced.
For example, the sample video frame sequences include s1, s2, s3 and s4, each sample video frame sequence includes 4 target sample face video frames, an auxiliary sample video frame sequence s11 is constructed according to the 1 st target sample face video frame of each sample video frame sequence, an auxiliary sample video frame sequence s21 is constructed according to the 2 nd target sample face video frame of each sample video frame sequence, an auxiliary sample video frame sequence s31 is constructed according to the 3 rd target sample face video frame of each sample video frame sequence, an auxiliary sample video frame sequence s41 is constructed according to the 4 th target sample face video frame of each sample video frame sequence, at this time, the intermediate feature of s1 and the intermediate feature of s11 may be spliced to obtain the target sample feature corresponding to s1, or the intermediate feature of s1 may be spliced with the intermediate feature of s31 to obtain the target sample feature corresponding to s 1.
In the application, different video frames in a short time period can reflect tiny facial movements, and different video frames in a long time period can reflect dynamic changes of emotion. These two temporal correlations have different characteristics, the short period of time inner facial motion mainly includes pixel level changes, conveying the subject's expression, whereas the long period of time emotion changes involve advanced semantic changes. Therefore, two sequence models (an initial first feature extraction network and an initial second feature extraction network) are used for independently modeling and decoupling the two time relations, so that better expression feature acquisition is realized, the obtained target sample features contain more information, and the accuracy of expression classification based on the target sample features is further ensured.
As an embodiment, determining the target sample feature of each video frame sequence from the intermediate features corresponding to the plurality of video frame sequences and the intermediate features corresponding to the plurality of auxiliary video frame sequences may include: splicing the middle features corresponding to the plurality of sample video frame sequences and the middle features corresponding to the plurality of auxiliary sample video frame sequences to obtain a middle splicing result corresponding to each sample video frame sequence; acquiring a global sample abstract of a face video to be identified; and splicing the global sample abstract with the intermediate splicing result corresponding to each sample video frame sequence respectively to obtain the target sample characteristics corresponding to each sample video frame sequence. The intermediate features corresponding to any one sample video frame sequence and the intermediate features corresponding to any one auxiliary sample video frame sequence can be spliced to obtain spliced features, and the spliced features are used as intermediate splicing results corresponding to the sample video frame sequences participating in splicing.
The method comprises the steps of extracting characteristics of a plurality of sample input frames through a backbone network to obtain characteristic extraction results; and inputting the feature extraction result into a summary extraction network to perform summary extraction to obtain a global sample summary, and then splicing the global sample summary with the middle splicing result corresponding to each sample video frame sequence to obtain the target sample feature corresponding to each sample video frame sequence.
The abstract extraction network can comprise a full connection layer, the feature extraction result is input into the full connection layer for processing to obtain a processing result, and then the processing result is activated to obtain a global sample abstract, wherein an activation function for the activation processing can be. At this time, the processing procedure of the digest extraction network may be expressed as +.>For the feature extraction result, < > Suo>Is a full connection layer->To activate the function, S1 is the global sample digest.
After the target sample characteristics are obtained, the target sample characteristics of the sample video frame sequence can be subjected to full connection processing through an initial classification layer, so that the probability that the sample video frame sequence belongs to different expression categories is obtained. And traversing all sample video frame sequences according to the process to obtain the probability that each sample video frame sequence belongs to different expression categories.
In some implementations, the initial model includes an initial base feature extraction network, an initial first feature extraction network, an initial second feature extraction network, and an initial classification layer, and the pseudocode for a portion of the data processing in the initial model is as follows:
wherein,for a sample video frame sequence, < > for>For auxiliary sample video frame sequences +.>Processing of extracting a network for basic features, +.>For an initial sample feature corresponding to a sample video frame sequence, < >>For the corresponding initial sample feature of the auxiliary sample video frame sequence,/->Processing for extracting a network for initial first characteristics, < >>Processing to extract the network for the initial second feature, < >>For the middle feature corresponding to the sample video frame sequence, < >>For the intermediate feature corresponding to the sequence of auxiliary sample video frames, < >>And splicing the intermediate features corresponding to the sample video frame sequences and the intermediate features corresponding to the auxiliary sample video frame sequences.
After obtainingAfterwards, +.>And splicing the global sample abstract to obtain target sample characteristics corresponding to the sample video frame sequence, and processing the target sample characteristics corresponding to the sample video frame sequence by an initial classification layer to obtain the sample video frame sequence genus Probabilities for different expression categories.
S130, determining a first expression category to which each sample video frame sequence belongs and a second expression category to which the sample face video belongs according to the probability that each sample video frame sequence belongs to different expression categories.
For each sample video frame sequence, determining one expression category to which the sample video frame sequence most possibly belongs as a first expression category to which the sample video frame sequence belongs according to the probability that the sample video frame sequence belongs to different expression categories. The expression category with the highest probability in the probabilities of the sample video frame sequence belonging to different expression categories can be used as the first expression category to which the sample video frame sequence belongs.
According to the probability that the plurality of sample video frame sequences belong to different expression categories, one expression category to which the sample face video most possibly belongs can be determined and used as a second expression category to which the sample face video belongs. For each expression category, calculating the average value of probabilities that a plurality of sample video frame sequences belong to the same expression category as the probability that a sample face video belongs to the expression category; and taking the expression category with the highest probability in the probabilities that the sample face video belongs to different expression categories as a second expression category to which the sample face video belongs.
For example, the sample face video includes a sample video frame sequence c1 and a sample video frame sequence c2, wherein c1 belongs to a happy probability p11, a sad probability pc12, an anger probability pc13, a happy probability p21, a sad probability pc22, an anger probability pc23, an average value of pc11 and pc21 is calculated as a happy probability py1 of the sample face video, an average value of pc12 and pc22 is calculated as a sad probability py2 of the sample face video, an average value of pc13 and pc23 is calculated as a anger probability py3 of the sample face video, and an expression category having the highest probability value among py1, py2, and py3 is acquired as a second expression category to which the sample face video belongs.
And S140, determining a consistency index corresponding to the sample face video according to the duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong.
The method comprises the steps of determining the duty ratio of each first expression category according to the first expression category to which a plurality of sample video frame sequences belong, and determining the consistency index corresponding to the sample face video according to the duty ratio of each first expression category. The consistency index is used for indicating the consistency degree of the expression category identified by the initial model aiming at different sample video frame sequences in the sample face video.
In some embodiments, the first expression category with the highest occupancy rate may be determined according to the occupancy rate of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; and taking the duty ratio of the first expression category with the highest duty ratio as a consistency index corresponding to the sample video.
For example, the sample face video includes a sample video frame sequence c1, a sample video frame sequence c2, a sample video frame sequence c3, and a sample video frame sequence c4, where the first expression category to which the sample video frame sequence c1 belongs is happy, the first expression category to which the sample video frame sequence c2 belongs is happy, the first expression category to which the sample video frame sequence c3 belongs is sad, the first expression category to which the sample video frame sequence c4 belongs is anger, at this time, it is determined that the first expression category is the highest duty ratio of happy, and the first expression category is the duty ratio of happy is 2/4=0.5 as a consistency index of the sample face video.
In addition, the expression recognition can be performed on the sample facial video through a plurality of different recognition models, so that the probability that the sample facial video belongs to different expression categories under each recognition model is obtained, for each recognition model, the target expression category (the expression category with the highest probability in the probability that the sample facial video belongs to different expression categories under the recognition model) of each recognition model for the sample facial video is determined according to the probability that the sample facial video belongs to different expression categories under the recognition model, and the maximum value of the duty ratio of the target expression category under the recognition model is determined to be used as the consistency index of the sample facial video.
For example, the recognition models having a certain recognition capability are sbmx1, sbmx2, sbmx3, and sbmx4, the sample face video is determined to be pleasing by the sbmx1, the sample face video is determined to be pleasing by the sbmx2, the sample face video is determined to be sad by the sbmx3, and the sample face video is determined to be pleasing by the sbmx4, and at this time, the consistency index of the sample face video is determined to be 3/4.
And S150, determining a loss value of the sample face video according to a second expression category to which the sample face video belongs.
The loss value of the sample face video can be determined according to the probability of the second expression category to which the sample face video belongs. The loss value is used for indicating the accuracy of the initial model for carrying out expression recognition on the sample face video.
For example, a cross entropy loss value or a mean square error may be calculated as a loss value of the sample face video according to a probability of the second expression class to which the sample face video belongs and a sample tag of the sample face video.
S160, determining an adjustment parameter according to the loss value and the consistency index; and adjusting the loss value according to the adjustment parameter to obtain a target loss value corresponding to the sample face video.
The loss value and the consistency index of the sample face video are obtained, and the adjustment parameters can be determined according to the comparison result of the loss value and the loss value index and the comparison result of the consistency index and the index threshold. The loss value index and the index threshold value may be set based on requirements, which is not limited by the present application.
In some embodiments, if the consistency index is below the index threshold, obtaining a first value as the adjustment parameter; if the consistency index is not lower than the index threshold and the loss value is not higher than the loss value index, acquiring a second value as an adjustment parameter; the first value is higher than the second value; if the consistency index is not lower than the index threshold and the loss value is higher than the loss value index, acquiring a third numerical value as an adjustment parameter; the second value is higher than the third value. Specific values of the first value, the second value and the third value are not limited, and only the first value, the second value and the third value are required to be sequentially reduced.
It will be appreciated that expression categories presented by a plurality of sample video frame sequences extracted from a sample face video with a shorter duration have similarity, and generally, expression categories presented by the plurality of sample video frame sequences are considered to be the same, and the initial model should have a higher theoretical consistency index for expression categories identified by the plurality of sample video frame sequences in the sample face video. However, in practice, there may be cases where the initial model is not easy to recognize expression for the sample face video due to light, image quality, or occlusion in the sample face video, and such sample face video may be regarded as a difficult sample (or referred to as a difficult sample). In practice, there may be an expression class error that marks the sample face video, in which case the sample face video is considered as a noise sample. In the present application, expression recognition results for a sequence of multiple sample video frames derived from the same noise sample appear to be highly consistent (i.e., highly consistent index) but have significant penalty, i.e., high penalty value.
Based on this principle, in the present application, noise samples and difficult samples can be identified from the consistency index and the loss value. Thereafter, in order to enhance the expression recognition ability of the initial model to the difficult sample, it is necessary to strengthen the learning of the difficult sample, and in order to avoid the influence of the noise sample on the initial model, it is necessary to weaken the learning of the noise sample by the initial model.
In the application, if the consistency index is not lower than the index threshold and the loss value is not higher than the loss value index, the consistency of the sample face video is determined to be better and the loss value is lower, the sample face video is a conventional sample, and a second numerical value is obtained as an adjustment parameter.
If the consistency index is lower than the index threshold, determining that the consistency of the sample face video is poor, at the moment, considering the sample face video as a difficult sample which is difficult to identify by the initial model, and determining a first numerical value which is higher than the adjustment parameter of the conventional sample as the adjustment parameter of the sample video, so that the adjusted target loss value of the difficult sample is higher than the adjusted target loss value corresponding to the conventional sample, and enhancing the learning ability of the initial model on the difficult sample.
The difficult sample is called as a difficult sample, and the difficulty of correctly classifying the sample is high, and usually, the complex confusion of expression and facial actions in facial dynamic expression recognition also causes the difficulty of correctly classifying the sample due to the environment, light and visual angle, so that the application needs to strengthen the study of the model on the sample.
If the consistency index is not lower than the index threshold and the loss value is higher than the loss value index, determining that the consistency of the sample face video is better and the loss value is higher, indicating that the sample face video is a noise sample with wrong sample labels, and determining a third value lower than the adjustment parameter of the conventional sample as the adjustment parameter of the sample video, so that the adjusted target loss value of the difficult sample is lower than the adjusted target loss value corresponding to the conventional sample, and weakening the learning ability of the initial model on the noise sample.
Noise samples refer to samples whose labels are wrong, such as a sample that is actually happy is labeled sad, which may cause confusion during model training and reduce performance, and thus, there is a need for reducing model learning for such samples in the present application.
In some embodiments, the first value is typically greater than 1, for example, the first value may be 1.1, since difficult samples require reinforcement learning; the conventional sample is sufficient to maintain the existing degree of learning, and therefore the second value is typically 1; whereas noise samples need reduced learning, the third value is typically less than 1, e.g. the third value may be 0.
For example, as shown in fig. 4, assuming that an initial recognition model is used to recognize 4 different expression categories, the initial recognition model recognizes 4 different expression categories (each circle represents one expression category and different filling colors represent different expression categories) for 4 sample video frame sequences in a sample face video 401, and the sample face video 401 corresponds to a low consistency index, thereby determining the sample face video 401 as a difficult sample, and determining an adjustment parameter of the sample face video 401 as 1.1, so that the initial model learns the sample face video 401 more; the initial recognition model comprises 1 expression class aiming at the expression class recognized by the sample face video 402, and the loss value of the sample face video 402 is smaller (smaller than the set loss value index), so that the sample face video 402 is determined to be a conventional sample, and the adjustment parameter of the sample face video 402 is determined to be 1, so that the initial model does not increase learning or reduce learning on the sample face video; the expression categories identified by the initial identification model aiming at the sample face video 403 comprise 1 expression category, and the loss value of the sample face video 403 is larger (larger than the set loss value index), so that the sample face video 403 is determined to be a noise sample, and the adjustment parameter of the sample face video 403 is determined to be 0.2, so that the initial model can learn less for the sample face video 403.
After the adjustment parameter is obtained, the loss value can be calculated according to the adjustment parameter to obtain the target loss value, wherein the calculation can be any calculation function capable of realizing the positive correlation between the adjustment parameter and the target loss value.
For example, the product of the adjustment parameter and the loss value may be calculated as the target loss value, and the calculation process of the target loss value refers to formula one, which is as follows:
(one)
Wherein loss is a target loss value,loss value for sample face video x, < +.>Representing sample face video x belonging to noise sample, +.>An adjustment parameter (third value) corresponding to the noise sample, ->Representing that sample face video x belongs to a conventional sample, +.>Is conventionalSample-corresponding adjustment parameter (second value), -a sample-corresponding adjustment parameter (second value), ->Representing sample face video x as a difficult sample, +.>The adjustment parameter (first value) corresponding to the difficult sample.
S170, adjusting parameters of the initial model according to the target loss value to obtain the expression recognition model.
After the target loss value of the sample face video is obtained, parameters of the initial model can be adjusted according to the target loss value of the sample face video, so that a model with the adjusted parameters is obtained and is used as an expression recognition model.
The samples for training the initial model may include a plurality of batches of samples, each batch of samples is used for performing one iteration training, each iteration training process performs parameter adjustment on the initial model before the iteration training process through one batch of samples, an adjusted initial model corresponding to the iteration training process is obtained, all the iteration training processes are traversed, and the adjusted initial model corresponding to the last iteration training process is used as the expression recognition model.
The samples of one batch of each iterative training process may include a plurality of sample face videos, the target loss value of each sample face video may be determined according to the above-mentioned manner of S110-S160, the target loss values of the plurality of sample face videos in one batch are summed to obtain a batch loss value of the samples of the batch, and parameters of the initial model before the iterative training process are adjusted by the batch loss value to obtain an adjusted initial model corresponding to the iterative training process.
In other embodiments, the recognition capability of the adjusted initial model corresponding to each iterative training process may be evaluated to obtain an evaluation result, and if the evaluation result meets the condition, the adjusted initial model corresponding to the iterative training process is obtained and used as the expression recognition model. The F1 score and the accuracy (the ratio of the number of correctly predicted samples to the total number of samples) of the adjusted initial model corresponding to each iterative training process may be calculated as an evaluation result, where the evaluation result satisfies a condition when the evaluation result is smaller than a set evaluation threshold. The evaluation threshold may be a value set based on the requirement, for example, when the F1 score is used as the evaluation result, the evaluation threshold is a threshold corresponding to the F1 score, and for example, when the accuracy is used as the evaluation result, the evaluation threshold is a threshold corresponding to the accuracy.
In this embodiment, the consistency index is used to indicate the consistency degree of the expression category identified by the initial model for different sample video frame sequences in the sample face video, and in general, the expression identification results of multiple sample video frame sequences derived from the same sample face video are theoretically consistent, that is, the consistency index is higher, but if the consistency index of the expression identification results of multiple sample video frame sequences derived from the same sample face video is lower, it can be determined that the sample face video is a difficult sample, so the consistency index can reflect the difficulty degree of performing expression identification on the sample face video. The loss value is used for indicating the accuracy of the initial model for carrying out expression recognition on the sample face video, generally, if the consistency index of the expression recognition results of a plurality of sample video frame sequences derived from the same sample face video is higher, but the loss value is also higher, the possible reason is that the sample face video is marked incorrectly, namely the sample face video is a noise sample, based on the principle, whether the sample face video is a difficult sample or a noise sample is recognized by combining the consistency index and the loss value, and the adjustment parameter is determined pertinently so as to adjust the loss value based on the adjustment parameter, thereby enhancing the learning of the initial model on the difficult sample and weakening the learning of the initial model on the noise sample, further ensuring the accurate expression recognition capability of the training-obtained expression recognition model on the difficult sample, effectively reducing the recognition accuracy of the expression recognition model influenced by the noise sample, solving the problems of confusion and difficult sample mining caused by the noise sample in dynamic expression recognition, and further ensuring the expression recognition accuracy of the recognition model obtained by training according to the target loss value. In addition, when the expression recognition model is obtained through training by the method, the method does not depend on additional information except the samples, and can realize accurate distinction of the conventional samples, the noise samples and the difficult samples under the condition of simultaneously having the noise samples and the difficult samples, so that the problems of the noise samples and the difficult samples can be better processed, and the method is easy to popularize in the training process of large-scale real dynamic facial expression data, and the performance of the model is improved.
Meanwhile, in the embodiment, the sample position parameters can accurately indicate the video frame where the key sample expression is located, so that the key sample expression can be displayed more accurately by resampling the target sample face video frame according to the sample position parameters, the accuracy of constructing a plurality of sample video frame sequences according to a plurality of target sample face video frames is improved, and the recognition effect of the expression recognition model obtained through training is further improved.
In addition, in the embodiment, motion encoding-expression encoding is performed in the processing process of the sample video frame sequence to obtain corresponding first sample characteristics, and expression encoding-motion encoding is performed simultaneously to obtain corresponding second sample characteristics, and the target sample characteristics are determined through the first sample characteristics and the second sample characteristics, so that accurate mining of dynamic characteristics of dynamic expressions of the sample video frame sequence is realized, the target sample characteristics more accurately represent dynamic expressions of the sample video frame sequence, meanwhile, the target sample characteristics are added with global sample abstracts corresponding to the sample face video, the accuracy of the target sample characteristics is further increased, and the recognition effect of the expression recognition model obtained through training is further improved.
Referring to fig. 5, fig. 5 shows a flowchart of an expression recognition method according to an embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
s210, acquiring a face video to be identified.
The face video to be recognized may refer to a video for performing expression recognition and including a dynamic expression of a face, for example, the face video to be recognized may include a video that a man is crying, and for example, the face video to be recognized may include a video that a woman is smiling.
Generally, in order to avoid the situation that the face video to be recognized includes too many dynamic expressions of the face, which results in difficulty in accurately recognizing the face video to be recognized, the length of the face video to be recognized should not be too long, for example, the face video to be recognized may be a video with a duration of 3s-5 s.
If the duration of the video to be recognized is longer after the video to be recognized including the facial dynamic expression is acquired, the acquired video to be recognized can be cut into a plurality of videos with shorter duration, and each video after cutting can be respectively used as a video to be recognized.
S220, inputting the face video to be recognized into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories.
The training process of the expression recognition model is described with reference to S110-S170 above, and will not be repeated here.
The face video to be recognized can be input into an expression recognition model, feature extraction is carried out on the face video to be recognized by the expression recognition model, and expression classification is carried out on the basis of the extracted features, so that the probability that the face video to be recognized belongs to different expression categories is obtained. For example, the expression categories aimed at by the expression recognition model include happiness, sadness and anger, and the face video to be recognized is input into the expression recognition model to obtain probability pd1 that the face video to be recognized belongs to happiness, probability pd2 that the face video to be recognized belongs to sadness and probability pd3 that the face video to be recognized belongs to anger.
As described above, the initial model includes an initial feature extraction network and an initial classification layer, and the initial feature extraction network includes an initial basic feature extraction network, an initial first feature extraction network, and an initial second feature extraction network, and thus, the trained expression recognition model may include a feature extraction network corresponding to the initial feature extraction network and a classification layer corresponding to the initial classification layer, and the feature extraction network includes a basic feature extraction network corresponding to the initial basic feature extraction network, a first feature extraction network corresponding to the initial first feature extraction network, and a first feature extraction network corresponding to the initial second feature extraction network.
As an implementation manner, a plurality of input frames can be sampled from the face video to be identified, and a plurality of video frame sequences are determined according to the plurality of input frames; and then, inputting a plurality of video frame sequences into the expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories.
It is possible to uniformly sample a plurality of input frames from the face video to be recognized and then uniformly divide the plurality of input frames into a plurality of video frame sequences. For example, 40 input frames are uniformly sampled from the face video to be identified, and then the 40 input frames are uniformly divided into 10 video frame sequences, each comprising 4 input frames.
As one embodiment, determining a plurality of video frame sequences from a plurality of input frames may include: extracting characteristics of a plurality of input frames through a backbone network to obtain input characteristics; determining a position parameter according to the input characteristics, wherein the position parameter is used for indicating the position of a key frame in the face video to be identified, and the key frame is a video frame presenting a key expression; sampling in the sample face video according to the position parameters to obtain a plurality of target face video frames; and constructing a plurality of video frame sequences according to the plurality of target face video frames.
The backbone network may be a 2-dimensional convolutional neural network, a 3-dimensional convolutional neural network, or a visual transducer network for video recognition. The multiple input frames may be input into a backbone network for feature extraction, the backbone network outputs the input features, and then the input features may be input into a fully-connected layer, where the fully-connected layer processes the input features to obtain location parameters, where the location parameters may include a center location and a step size.
The key expression refers to the expression category with the highest appearance frequency in the face video to be identified or the expression category with the highest probability in a plurality of input frames, and the position parameter can accurately indicate the video frame presenting the key expression, so that a plurality of target face video frames determined according to the position parameter are more matched with the key frame in the face video to be identified, and the plurality of target face video frames can more accurately reflect the key expression.
Meanwhile, the key expression is actually the expression category predicted according to a plurality of input frames through the trunk network and the full-connection layer, and the position parameter is the parameter predicted according to a plurality of input frames through the trunk network and the full-connection layer, so that the key expression and the position parameter do not need to be manually determined, and the acquisition efficiency of the position parameter of the key expression is greatly improved.
After obtaining the position parameter, resampling a plurality of video frames in the face video to be identified through the position parameter, wherein each video frame obtained through resampling serves as a target face video frame. The process of resampling a plurality of target face video frames in the face video to be identified according to the position parameters may include: determining a central key video frame according to the central position; taking the central key video frame as the center, and sampling in the face video to be identified according to the step length to obtain a plurality of sampling video frames; and summarizing the plurality of sampling video frames and the central key video frame to obtain a plurality of target face video frames.
The central position is used for indicating the position of the key frame at the central position in the face video to be identified, so that the product of the total number of video frames in the face video to be identified and the central position can be used as the frame number of the central key frame, and the central key video frame can be determined. And then taking the central key video frame as a center, and respectively sampling in video frames positioned at two sides of the central key video frame in the face video to be identified according to the step length direction to obtain a plurality of sampled video frames.
After obtaining the plurality of target face video frames, the plurality of target face video frames can be divided according to the arrangement sequence of the plurality of target face video frames in the face video to be identified, so as to obtain a plurality of video frame sequences. The method comprises the steps of uniformly dividing a plurality of target face video frames to obtain a plurality of groups, and arranging the plurality of target face video frames in each group according to the arrangement sequence of the target face video frames in the face video to obtain a video frame sequence.
After determining a plurality of video frame sequences, determining a target feature of each video frame sequence through a feature extraction network; determining the probability that each video frame sequence belongs to different expression categories according to the target characteristics of each video frame sequence through a classification layer; and determining the probability that the face video to be recognized belongs to different expression categories according to the probabilities that the video frame sequences belong to different expression categories.
The method comprises the steps of extracting features of a video frame sequence by a feature extraction network to obtain target features of the video frame sequence, and carrying out full connection processing on the target features of the video frame sequence by a classification layer to obtain probabilities that the video frame sequence belongs to different expression categories.
As one embodiment, the feature extraction network includes a base feature extraction network, a first feature extraction network, and a second feature extraction network; the extraction process of the target feature of each video frame sequence may include: constructing a plurality of auxiliary video frame sequences according to the plurality of video frame sequences, wherein different target face video frames in the auxiliary video frame sequences belong to different video frame sequences; extracting the characteristics of each video frame sequence through a basic characteristic extraction network, and taking the characteristics as initial characteristics corresponding to each video frame sequence; extracting the characteristics of each auxiliary video frame sequence through a basic characteristic extraction network, and taking the characteristics as initial characteristics corresponding to each auxiliary video frame sequence; performing motion coding on the initial feature corresponding to each video frame sequence by using a first feature extraction network to obtain motion coding features corresponding to each video frame sequence; performing expression coding by a second feature extraction network based on action coding features corresponding to a plurality of video frame sequences to obtain intermediate features corresponding to each video frame sequence; performing expression coding by a second feature extraction network based on initial features corresponding to each auxiliary video frame sequence to obtain expression coding features corresponding to each auxiliary video frame sequence; performing motion coding on expression coding feature rows corresponding to a plurality of auxiliary video frame sequences by a first feature extraction network to obtain intermediate features corresponding to each auxiliary video frame sequence; and determining the target characteristic of each video frame sequence according to the intermediate characteristics corresponding to the video frame sequences and the intermediate characteristics corresponding to the auxiliary video frame sequences.
For each video frame sequence, the target face video frames in the video frame sequence are arranged according to the arrangement sequence of the target face video frames in the face video, and no target face video frames in other video frame sequences exist between any adjacent target face video frames in the video frame sequence. For example, four target face video frames with frame numbers 4, 8, 12, and 16 are sequentially arranged to obtain a video frame sequence L11, four target face video frames with frame numbers 20, 24, 28, and 32 are sequentially arranged to obtain a video frame sequence L12, four target face video frames with frame numbers 36, 40, 44, and 48 are sequentially arranged to obtain a video frame sequence L13, and four target face video frames with frame numbers 52, 56, 60, and 64 are sequentially arranged to obtain a video frame sequence L14.
An auxiliary video frame sequence can be constructed according to a plurality of video frame sequences, and different target face video frames in the auxiliary video frame sequence belong to different video frame sequences. An auxiliary video frame sequence can be constructed according to the target face video frames with the same sequence number in each video frame sequence (the sequence number of the target face video frames in the video frame sequence), or an auxiliary video frame sequence can be constructed by randomly selecting one target face video frame in each video frame sequence. For example, an auxiliary video frame sequence may be constructed from the first target face video frame in each video frame sequence, and for example, an auxiliary video frame sequence may be constructed from the target face video frame sequence 1 in the 1 st video frame sequence, the target face video frame sequence 2 in the 2 nd video frame sequence, the target face video frame sequence 3 in the 3 rd video frame sequence, and the target face video frame sequence 4 in the 4 th video frame sequence.
For example, four target face video frames with frame numbers 4, 8, 12, and 16 are sequentially arranged to obtain a video frame sequence L11, four target face video frames with frame numbers 20, 24, 28, and 32 are sequentially arranged to obtain a video frame sequence L12, four target face video frames with frame numbers 36, 40, 44, and 48 are sequentially arranged to obtain a video frame sequence L13, and four target face video frames with frame numbers 52, 56, 60, and 64 are sequentially arranged to obtain a video frame sequence L14. The method comprises the steps of obtaining an auxiliary video frame sequence GL11 corresponding to a target face video frame construction video frame sequence L11 with frame numbers of 4, 20, 36 and 52, obtaining an auxiliary video frame sequence GL12 corresponding to a target face video frame construction video frame sequence L12 with frame numbers of 8, 24, 40 and 56, obtaining an auxiliary video frame sequence GL13 corresponding to a target face video frame construction video frame sequence L13 with frame numbers of 12, 28, 44 and 60, and obtaining an auxiliary video frame sequence GL14 corresponding to a target face video frame construction video frame sequence L14 with frame numbers of 16, 32, 48 and 64.
Inputting the video frame sequence into a basic feature extraction network to obtain initial features corresponding to the video frame sequence, and inputting the auxiliary video frame sequence into the basic feature extraction network to obtain initial features corresponding to the auxiliary video frame sequence.
And then, the initial characteristics corresponding to the video frame sequence can be processed through the first characteristic extraction network and the second characteristic extraction network to obtain intermediate characteristics corresponding to the video frame sequence, and the initial characteristics corresponding to the auxiliary video frame sequence can be processed through the first characteristic extraction network and the second characteristic extraction network to obtain intermediate characteristics corresponding to the auxiliary video frame sequence. Traversing all the video frame sequences and the auxiliary video frame sequences to obtain intermediate features corresponding to the plurality of video frame sequences and intermediate features corresponding to the plurality of auxiliary video frame sequences, and determining target features corresponding to each video frame sequence according to the intermediate features corresponding to the plurality of video frame sequences and the intermediate features corresponding to the plurality of auxiliary video frame sequences.
The initial features corresponding to the video frame sequences comprise the respective features of each target face video frame in the video frame sequences, the respective features of each target face video frame in the video frame sequences can be input into a first feature extraction network for motion coding, respective motion coding results of each target face video frame in the video frame sequences are obtained, the respective motion coding results of a plurality of target face video frames in the video frame sequences are averaged to obtain motion coding features corresponding to the video frame sequences, all the video frame sequences are traversed to obtain the respective motion coding features of the plurality of video frame sequences, then the respective motion coding features corresponding to the plurality of video frame sequences are input into a second feature extraction network, and the motion coding features corresponding to the plurality of video frame sequences are subjected to expression coding by the second feature extraction network to obtain intermediate features corresponding to the plurality of video frame sequences. When the second feature extraction network encodes the motion coding feature corresponding to each video frame sequence, the motion coding features corresponding to other video frame sequences are referred to, that is, a plurality of motion coding features corresponding to a plurality of video frame sequences are input into the second feature extraction network at the same time, so as to obtain intermediate features corresponding to the video frame sequences.
Similarly, the initial features corresponding to the auxiliary video frame sequence include the respective features of each target face video frame in the auxiliary video frame sequence, the respective features of each target face video frame in the auxiliary video frame sequence can be input into a second feature extraction network for performing expression coding to obtain respective expression coding results of each target face video frame in the auxiliary video frame sequence, the respective expression coding results of the plurality of target face video frames in the auxiliary video frame sequence are averaged to obtain expression coding features corresponding to the auxiliary video frame sequence, all the auxiliary video frame sequences are traversed to obtain respective expression coding features of the plurality of auxiliary video frame sequences, then the plurality of expression coding features corresponding to the plurality of auxiliary video frame sequences are input into a first feature extraction network, and the first feature extraction network performs motion coding on the plurality of expression coding features corresponding to the plurality of auxiliary video frame sequences to obtain respective intermediate features of the plurality of auxiliary video frame sequences. When the second feature extraction network performs expression coding on the respective features of each target face video frame in the auxiliary video frame sequence, the respective features of other target face video frames in the auxiliary video frame sequence are referred to, that is, the features of a plurality of target face video frames in each auxiliary video frame sequence are input into the second feature extraction network at the same time, so that expression coding results corresponding to the target face video frames are obtained.
Obtaining intermediate features corresponding to the plurality of video frame sequences and intermediate features corresponding to the plurality of auxiliary video frame sequences; the intermediate features corresponding to the video frame sequences and the intermediate features corresponding to the auxiliary video frame sequences can be spliced to obtain spliced features, and the spliced features serve as target features corresponding to the video frame sequences participating in splicing.
As an embodiment, the intermediate feature corresponding to any one of the video frame sequences and the intermediate feature corresponding to any one of the auxiliary video frame sequences may be spliced.
As yet another embodiment, determining the target feature of each video frame sequence from the intermediate features corresponding to the plurality of video frame sequences and the intermediate features corresponding to the plurality of auxiliary video frame sequences includes: splicing the intermediate features corresponding to the plurality of video frame sequences and the intermediate features corresponding to the plurality of auxiliary video frame sequences to obtain an intermediate splicing result corresponding to each video frame sequence; acquiring a global abstract of a face video to be identified; and splicing the global abstract with the intermediate splicing result corresponding to each video frame sequence respectively to obtain the target feature corresponding to each video frame sequence. The intermediate features corresponding to any one video frame sequence and the intermediate features corresponding to any one auxiliary video frame sequence can be spliced to obtain spliced features, and the spliced features are used as intermediate splicing results corresponding to the video frame sequences participating in splicing.
The method comprises the steps of extracting characteristics of a plurality of input frames through a backbone network to obtain characteristic extraction results; and inputting the feature extraction result into a summary extraction network to perform summary extraction, so as to obtain a global summary. And can also be paired. After the intermediate splicing results corresponding to the video frame sequences are obtained, the global abstract is spliced with the intermediate splicing results corresponding to the video frame sequences respectively, and the target characteristics corresponding to the video frame sequences are obtained.
After the target features are obtained, the target features of the video frame sequence can be subjected to full connection processing through a full connection layer in the initial classification layer, so that the probability that the video frame sequence belongs to different expression categories is obtained. And traversing all video frame sequences according to the process to obtain the probability that each video frame sequence belongs to different expression categories.
After the probabilities that the video frame sequences belong to different expression categories are obtained, the probabilities that the face video to be recognized belongs to the different expression categories can be determined continuously according to the probabilities that the video frame sequences belong to the different expression categories.
According to the probability that the plurality of video frame sequences belong to different expression categories, one expression category to which the face video to be recognized most possibly belongs can be determined and used as a second expression category to which the face video belongs. For each expression category, an average value of probabilities that a plurality of video frame sequences belong to the expression category can be calculated as the probability that the face video belongs to the expression category.
For example, the face video to be recognized includes a video frame sequence d1 and a video frame sequence d2, wherein d1 belongs to a happy probability p11, a sad probability pd12, an anger probability pd13, a d2 belongs to a happy probability p21, a sad probability pd22, an anger probability pd23, an average value of pd11 and pd21 is calculated as a happy probability py1 of the face video, an average value of pd12 and pd22 is calculated as a sad probability py2 of the face video, and an average value of pd13 and pd23 is calculated as an anger probability py3 of the face video.
S230, determining the expression category with the highest probability based on the probability that the face video to be identified belongs to different expression categories, and taking the expression category as the expression identification result of the face video to be identified.
The face video to be recognized comprises a video frame sequence d1 and a video frame sequence d2, wherein d1 belongs to happy probability p11, sad probability pd12, anger probability pd13, d2 belongs to happy probability p21, sad probability pd22 and anger probability pd23, an average value of pd11 and pd21 is calculated as probability py1 of the face video belonging to happy, an average value of pd12 and pd22 is calculated as probability py2 of the face video belonging to sad, an average value of pd13 and pd23 is calculated as probability py3 of the face video belonging to anger, and sad is determined as expression recognition result of the face video to be recognized, wherein py2 is the highest.
In this embodiment, the consistency index is used to indicate the consistency degree of the expression category identified by the initial model for different sample video frame sequences in the sample face video, and in general, the expression identification results of multiple sample video frame sequences derived from the same sample face video are theoretically consistent, that is, the consistency index is higher, but if the consistency index of the expression identification results of multiple sample video frame sequences derived from the same sample face video is lower, it can be determined that the sample face video is a difficult sample, so the consistency index can reflect the difficulty degree of performing expression identification on the sample face video. The loss value is used for indicating the accuracy of the initial model for carrying out expression recognition on the sample face video, generally, if the consistency index of the expression recognition results of a plurality of sample video frame sequences derived from the same sample face video is higher, but the loss value is also higher, the possible reason is that the sample face video is marked incorrectly, namely the sample face video is a noise sample, based on the principle, whether the sample face video is a difficult sample or a noise sample is recognized by combining the consistency index and the loss value, and the adjustment parameter is determined pertinently so as to adjust the loss value based on the adjustment parameter, thereby enhancing the learning of the initial model on the difficult sample and weakening the learning of the initial model on the noise sample, further ensuring the accurate expression recognition capability of the training-obtained expression recognition model on the difficult sample, effectively reducing the recognition accuracy of the expression recognition model influenced by the noise sample, solving the problems of confusion and difficult sample mining caused by the noise sample in dynamic expression recognition, and further ensuring the expression recognition accuracy of the recognition model obtained by training according to the target loss value. In addition, when the expression recognition model is obtained through training by the method, the method does not depend on additional information except the samples, and can realize accurate distinction of the conventional samples, the noise samples and the difficult samples under the condition of simultaneously having the noise samples and the difficult samples, so that the problems of the noise samples and the difficult samples can be better processed, and the method is easy to popularize in the training process of large-scale real dynamic facial expression data, and the performance of the model is improved.
In order to more clearly explain the technical scheme of the application, the expression recognition method of the application is explained below in combination with an exemplary scene. In the scene, the expression recognition model comprises a basic feature extraction network, a first feature extraction network, a second feature extraction network and a classification layer, wherein the first feature extraction network is used for performing motion coding, and the second feature extraction network is used for performing expression coding.
As shown in fig. 6, the face video to be identified is uniformly sampled to obtain a plurality of input frames, the plurality of input frames are input into a backbone network to perform feature extraction to obtain input features, and then the position parameters and the global abstract are determined according to the input features.
And resampling the face video to be identified through the position parameters to obtain 4 video frame sequences, wherein each video frame sequence comprises 4 video frames.
As shown in fig. 7, 4 video frames of each line are taken as one video frame sequence, and a plurality of auxiliary video frame sequences are determined according to the plurality of video frame sequences, wherein four video frame sequences of each column are taken as auxiliary video frame sequences. The video frame sequences of each row show motion changes of the same expression category, the video frame intervals of the auxiliary video frame sequences of each column are larger, and the auxiliary video frame sequences of each column show changes of different expression categories.
And inputting the initial features corresponding to the video frame sequences of the first row into a first feature extraction network for motion coding to obtain motion coding features corresponding to the video frame sequences of the first row, and then inputting all the motion coding features corresponding to the video frame sequences of the four rows into a second feature extraction network for expression coding to obtain intermediate features corresponding to the video frame sequences of each row. Simultaneously, inputting the auxiliary video frame sequences of the first column into a second feature extraction network for expression coding to obtain expression coding features corresponding to the auxiliary video frame sequences of the first column, and then inputting all the expression coding features corresponding to the auxiliary video frame sequences of the four columns into the first feature extraction network for motion coding to obtain intermediate features corresponding to the auxiliary video frame sequences of each column; then, splicing the middle features corresponding to the video frame sequences of the first row and the middle features corresponding to the auxiliary video frame sequences of the first column to obtain a middle splicing result corresponding to the video frame sequences of the first row; splicing the middle features corresponding to the video frame sequences of the second row and the middle features corresponding to the auxiliary video frame sequences of the second column to obtain a middle splicing result corresponding to the video frame sequences of the second row; and splicing the middle features corresponding to the video frame sequences of the third row and the middle features corresponding to the auxiliary video frame sequences of the third column to obtain a middle splicing result corresponding to the video frame sequences of the third row, and splicing the middle features corresponding to the video frame sequences of the fourth row and the middle features corresponding to the auxiliary video frame sequences of the fourth column to obtain a middle splicing result corresponding to the video frame sequences of the fourth row.
And then, splicing the global abstract with the intermediate splicing results corresponding to the video frame sequences of each row respectively to obtain the target features corresponding to the video frame sequences of each row. And inputting the target features corresponding to the video frame sequences of each row into a classification layer of the expression recognition model to obtain the probability that the video frame sequences of each row belong to different expression categories.
Aiming at each expression category, taking the average value of the probabilities that the video frame sequence of the first row, the video frame sequence of the second row, the video frame sequence of the third row and the video frame sequence of the fourth row belong to the expression category as the probability that the face video to be recognized belongs to the expression category, traversing all the expression categories, and obtaining the probabilities that the face video to be recognized belongs to different expression categories. And finally, determining the expression category with the highest probability value to be happy as the expression recognition result of the face video to be recognized according to the probability that the face video to be recognized belongs to different expression categories.
In the scene, the expression of the dynamic facial expression can be accurately identified, and the method plays a very important role in a plurality of systems such as an intelligent coaching system, a service robot, automatic driving, psychological health analysis and the like. Meanwhile, an initial model is built through a multi-layer sensing mechanism, a convolutional neural network, a Vision Transformer (visual codec) and the like, so that the difficulty of deploying the initial model is reduced, the method is suitable for all dynamic expression recognition products based on an artificial intelligent model, and the recognition capability of dynamic expression recognition can be effectively improved.
Referring to fig. 8, fig. 8 is a block diagram illustrating an expression recognition apparatus according to an embodiment of the present application, an apparatus 700 includes:
an acquisition module 710, configured to acquire a face video to be identified;
the recognition module 720 is configured to input a face video to be recognized into an expression recognition model to obtain probabilities that the face video to be recognized belongs to different expression categories, the expression recognition model is obtained by training an initial model through a target loss value corresponding to a sample face video, the target loss value is obtained after the loss value of the sample face video is adjusted through an adjustment parameter corresponding to the sample face video, the loss value is used for indicating the accuracy of the initial model in carrying out expression recognition on the sample face video, the adjustment parameter is determined through a consistency index corresponding to the sample face video and the loss value, and the consistency index is used for indicating the consistency degree of the initial model in terms of the expression categories recognized by different sample video frame sequences in the sample face video;
the determining module 730 is configured to determine, based on probabilities that the face video to be identified belongs to different expression categories, an expression category with a highest probability as an expression identification result of the face video to be identified.
Optionally, the apparatus further comprises a training module for determining a plurality of sample video frame sequences from the sample face video; determining the probability that each sample video frame sequence belongs to different expression categories through an initial model; according to the probability that each sample video frame sequence belongs to different expression categories, determining a first expression category to which each sample video frame sequence belongs and a second expression category to which a sample face video belongs; determining a consistency index corresponding to the sample face video according to the duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; determining a loss value of the sample face video according to a second expression category to which the sample face video belongs; determining an adjustment parameter according to the loss value and the consistency index; according to the adjustment parameters, the loss values are adjusted to obtain target loss values corresponding to the sample face video; and adjusting parameters of the initial model according to the target loss value to obtain the expression recognition model.
Optionally, the training module is further configured to obtain the first value as the adjustment parameter if the consistency index is lower than the index threshold; if the consistency index is not lower than the index threshold and the loss value is not higher than the loss value index, acquiring a second value as an adjustment parameter; the first value is higher than the second value; if the consistency index is not lower than the index threshold and the loss value is higher than the loss value index, acquiring a third numerical value as an adjustment parameter; the second value is higher than the third value.
Optionally, the training module is further configured to calculate a product of the loss value and the adjustment parameter, as a target loss value corresponding to the sample face video.
Optionally, the training module is further configured to determine a first expression category with a highest duty ratio according to a duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; and taking the duty ratio of the first expression category with the highest duty ratio as a consistency index corresponding to the sample video.
Optionally, the training module is further configured to use an expression class with the highest probability among probabilities that the sample video frame sequence belongs to different expression classes as a first expression class to which the sample video frame sequence belongs; for each expression category, calculating the average value of probabilities that a plurality of sample video frame sequences belong to the expression category as the probability that the sample face video belongs to the expression category; and taking the expression category with the highest probability in the probabilities that the sample face video belongs to different expression categories as a second expression category to which the sample face video belongs.
Optionally, the apparatus further comprises a preprocessing module, configured to sample a plurality of input frames from the face video to be identified; determining a plurality of video frame sequences from a plurality of input frames; the recognition module 720 is further configured to input a plurality of video frame sequences into the expression recognition model, so as to obtain probabilities that the face video to be recognized belongs to different expression categories.
Optionally, the preprocessing module is further configured to perform feature extraction on a plurality of input frames through a backbone network to obtain input features; determining a position parameter according to the input characteristics, wherein the position parameter is used for indicating the position of a key frame in the face video to be identified, and the key frame is a video frame presenting a key expression; sampling in the face video to be identified according to the position parameters to obtain a plurality of target face video frames; and constructing a plurality of video frame sequences according to the plurality of target face video frames.
Optionally, the position parameter includes a center position and a step size corresponding to the key frame; the preprocessing module is also used for determining a central key video frame according to the central position; taking the central key video frame as the center, and sampling in the face video to be identified according to the step length to obtain a plurality of sampling video frames; and summarizing the plurality of sampling video frames and the central key video frame to obtain a plurality of target face video frames.
Optionally, the expression recognition model comprises a feature extraction network and a classification layer; the identifying module 720 is further configured to determine, through the feature extraction network, a target feature of each video frame sequence based on the plurality of video frame sequences; determining the probability that each video frame sequence belongs to different expression categories according to the target characteristics of each video frame sequence through a classification layer; and determining the probability that the face video to be recognized belongs to different expression categories according to the probabilities that the video frame sequences belong to different expression categories.
Optionally, the feature extraction network includes a base feature extraction network, a first feature extraction network, and a second feature extraction network; the method comprises the steps that all target face video frames in a video frame sequence are arranged according to the arrangement sequence of all target face video frames in a face video to be identified, and no target face video frames in other video frame sequences exist between any adjacent target face video frames in the video frame sequence; the identifying module 720 is further configured to construct a plurality of auxiliary video frame sequences according to the plurality of video frame sequences, where different target face video frames in the auxiliary video frame sequences belong to different video frame sequences; extracting the characteristics of each video frame sequence through a basic characteristic extraction network, and taking the characteristics as initial characteristics corresponding to each video frame sequence; extracting the characteristics of each auxiliary video frame sequence through a basic characteristic extraction network, and taking the characteristics as initial characteristics corresponding to each auxiliary video frame sequence; performing motion coding on the initial feature corresponding to each video frame sequence by using a first feature extraction network to obtain motion coding features corresponding to each video frame sequence; performing expression coding by a second feature extraction network based on action coding features corresponding to a plurality of video frame sequences to obtain intermediate features corresponding to each video frame sequence; performing expression coding by a second feature extraction network based on initial features corresponding to each auxiliary video frame sequence to obtain expression coding features corresponding to each auxiliary video frame sequence; performing motion coding on expression coding feature rows corresponding to a plurality of auxiliary video frame sequences by a first feature extraction network to obtain intermediate features corresponding to each auxiliary video frame sequence; and determining the target characteristic of each video frame sequence according to the intermediate characteristics corresponding to the video frame sequences and the intermediate characteristics corresponding to the auxiliary video frame sequences.
Optionally, the identifying module 720 is further configured to splice the intermediate features corresponding to the multiple video frame sequences and the intermediate features corresponding to the multiple auxiliary video frame sequences to obtain an intermediate splicing result corresponding to each video frame sequence; acquiring a global abstract of a face video to be identified; and splicing the global abstract with the intermediate splicing result corresponding to each video frame sequence respectively to obtain the target feature corresponding to each video frame sequence.
It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.
Fig. 9 shows a block diagram of an electronic device for performing an expression recognition method according to an embodiment of the present application. The electronic device may be the terminal 20 or the server 10 in fig. 1, and it should be noted that, the computer system 1200 of the electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 9, the computer system 1200 includes a central processing unit (Central Processing Unit, CPU) 1201 which can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (Random Access Memory, RAM) 1203. In the RAM 1203, various programs and data required for the system operation are also stored. The CPU1201, ROM1202, and RAM 1203 are connected to each other through a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1208 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 1210 as needed, so that a computer program read out therefrom is installed into the storage section 1208 as needed.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. When executed by a Central Processing Unit (CPU) 1201, performs the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable storage medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the above embodiments.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the method of any of the embodiments described above.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause an electronic device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (14)

1. An expression recognition method, characterized in that the method comprises:
acquiring a face video to be identified;
inputting the face video to be recognized into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories; the training process of the expression recognition model comprises the following steps: determining a plurality of sample video frame sequences according to the sample face video; determining the probability that each sample video frame sequence belongs to different expression categories through an initial model; determining a first expression category to which each sample video frame sequence belongs and a second expression category to which the sample face video belongs according to the probability that each sample video frame sequence belongs to different expression categories; determining a consistency index corresponding to the sample face video according to the duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; determining a loss value of the sample face video according to the second expression category to which the sample face video belongs; determining an adjustment parameter according to the loss value and the consistency index; adjusting the loss value according to the adjustment parameter to obtain a target loss value corresponding to the sample face video; adjusting parameters of the initial model according to the target loss value to obtain the expression recognition model; the loss value is used for indicating the accuracy of the initial model in carrying out expression recognition on the sample face video; the consistency index is used for indicating the consistency degree of the expression category identified by the initial model aiming at different sample video frame sequences in the sample face video;
And determining the expression category with the highest probability based on the probability that the face video to be identified belongs to different expression categories, and taking the expression category with the highest probability as an expression identification result of the face video to be identified.
2. The method of claim 1, wherein determining an adjustment parameter based on the loss value and the consistency indicator comprises:
if the consistency index is lower than an index threshold, acquiring a first numerical value as the adjustment parameter;
if the consistency index is not lower than the index threshold and the loss value is not higher than the loss value index, acquiring a second value as the adjustment parameter; the first value is higher than the second value;
if the consistency index is not lower than the index threshold and the loss value is higher than the loss value index, acquiring a third numerical value as the adjustment parameter; the second value is higher than the third value.
3. The method of claim 1, wherein the adjusting the loss value according to the adjustment parameter to obtain the target loss value corresponding to the sample face video comprises:
and calculating the product of the loss value and the adjustment parameter to serve as a target loss value corresponding to the sample face video.
4. The method of claim 1, wherein the determining, according to the duty ratio of each of the first expression categories to which the plurality of sample video frame sequences belong, a consistency index corresponding to the sample face video includes:
determining a first expression category with the highest occupation ratio according to the occupation ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong;
and taking the duty ratio of the first expression category with the highest duty ratio as a consistency index corresponding to the sample face video.
5. The method of claim 1, wherein determining the first expression category to which each of the sample video frame sequences belongs and the second expression category to which the sample face video belongs according to the probability that each of the sample video frame sequences belongs to different expression categories comprises:
taking the expression category with the highest probability in the probabilities of the sample video frame sequence belonging to different expression categories as a first expression category to which the sample video frame sequence belongs;
for each expression category, calculating an average value of probabilities that the plurality of sample video frame sequences belong to the expression category as the probability that the sample face video belongs to the expression category;
And taking the expression category with the highest probability in the probabilities of the sample face video belonging to different expression categories as a second expression category to which the sample face video belongs.
6. The method according to claim 1, wherein before the inputting the face video to be recognized into the expression recognition model to obtain probabilities that the face video to be recognized belongs to different expression categories, the method further comprises:
sampling a plurality of input frames from the face video to be identified;
determining a plurality of video frame sequences from the plurality of input frames;
inputting the face video to be recognized into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories, wherein the method comprises the following steps:
and inputting the video frame sequences into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories.
7. The method of claim 6, wherein said determining a plurality of video frame sequences from said plurality of input frames comprises:
extracting features of the plurality of input frames through a backbone network to obtain input features;
determining a position parameter according to the input characteristics, wherein the position parameter is used for indicating the position of a key frame in the face video to be identified, and the key frame is a video frame presenting a key expression;
Sampling in the face video to be identified according to the position parameters to obtain a plurality of target face video frames;
and constructing a plurality of video frame sequences according to the plurality of target face video frames.
8. The method of claim 7, wherein the location parameters include a center location and a step size corresponding to the key frame; sampling in the face video to be identified according to the position parameter to obtain a plurality of target face video frames, including:
determining a central key video frame according to the central position;
taking the central key video frame as a center, and sampling in the face video to be identified according to the step length to obtain a plurality of sampling video frames;
and summarizing the plurality of sampling video frames and the central key video frame to obtain the plurality of target face video frames.
9. The method of claim 6, wherein the expression recognition model includes a feature extraction network and a classification layer;
inputting the plurality of video frame sequences into an expression recognition model to obtain probabilities that the face video to be recognized belongs to different expression categories, wherein the method comprises the following steps:
determining, by the feature extraction network, a target feature for each of the video frame sequences based on the plurality of video frame sequences;
Determining the probability of each video frame sequence belonging to different expression categories through the classification layer according to the target characteristics of each video frame sequence;
and determining the probability that the face video to be recognized belongs to different expression categories according to the probabilities that the video frame sequences belong to different expression categories.
10. The method of claim 9, wherein the feature extraction network comprises a base feature extraction network, a first feature extraction network, and a second feature extraction network; the target face video frames in the video frame sequence are arranged according to the arrangement sequence of the target face video frames in the face video to be identified, and no target face video frames in other video frame sequences exist between any adjacent target face video frames in the video frame sequence;
said determining, by said feature extraction network, a target feature for each of said video frame sequences based on said plurality of video frame sequences, comprising:
constructing a plurality of auxiliary video frame sequences according to the plurality of video frame sequences, wherein different target face video frames in the auxiliary video frame sequences belong to different video frame sequences;
extracting the characteristics of each video frame sequence through the basic characteristic extraction network to serve as initial characteristics corresponding to each video frame sequence;
Extracting the characteristics of each auxiliary video frame sequence through the basic characteristic extraction network to serve as initial characteristics corresponding to each auxiliary video frame sequence;
performing motion coding on initial features corresponding to each video frame sequence by the first feature extraction network to obtain motion coding features corresponding to each video frame sequence;
performing expression coding by the second feature extraction network based on action coding features corresponding to the video frame sequences to obtain intermediate features corresponding to each video frame sequence;
performing expression coding by the second feature extraction network based on initial features corresponding to each auxiliary video frame sequence to obtain expression coding features corresponding to each auxiliary video frame sequence;
performing motion coding on expression coding feature rows corresponding to the auxiliary video frame sequences by the first feature extraction network to obtain intermediate features corresponding to each auxiliary video frame sequence;
and determining the target characteristic of each video frame sequence according to the intermediate characteristics corresponding to the video frame sequences and the intermediate characteristics corresponding to the auxiliary video frame sequences.
11. The method of claim 10, wherein the determining the target feature for each of the video frame sequences based on the intermediate features corresponding to the plurality of video frame sequences and the intermediate features corresponding to the plurality of auxiliary video frame sequences comprises:
Splicing the intermediate features corresponding to the video frame sequences and the intermediate features corresponding to the auxiliary video frame sequences to obtain an intermediate splicing result corresponding to each video frame sequence;
acquiring a global abstract of the face video to be identified;
and splicing the global abstract with the intermediate splicing result corresponding to each video frame sequence respectively to obtain the target feature corresponding to each video frame sequence.
12. An expression recognition apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring the face video to be identified;
the recognition module is used for inputting the face video to be recognized into an expression recognition model to obtain the probability that the face video to be recognized belongs to different expression categories; the training process of the expression recognition model comprises the following steps: determining a plurality of sample video frame sequences according to the sample face video; determining the probability that each sample video frame sequence belongs to different expression categories through an initial model; determining a first expression category to which each sample video frame sequence belongs and a second expression category to which the sample face video belongs according to the probability that each sample video frame sequence belongs to different expression categories; determining a consistency index corresponding to the sample face video according to the duty ratio of each first expression category in the first expression categories to which the plurality of sample video frame sequences belong; determining a loss value of the sample face video according to the second expression category to which the sample face video belongs; determining an adjustment parameter according to the loss value and the consistency index; adjusting the loss value according to the adjustment parameter to obtain a target loss value corresponding to the sample face video; adjusting parameters of the initial model according to the target loss value to obtain the expression recognition model; the loss value is used for indicating the accuracy of the initial model in carrying out expression recognition on the sample face video; the consistency index is used for indicating the consistency degree of the expression category identified by the initial model aiming at different sample video frame sequences in the sample face video;
The determining module is used for determining the expression category with the highest probability based on the probability that the face video to be recognized belongs to different expression categories, and the expression category is used as the expression recognition result of the face video to be recognized.
13. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-11.
14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for performing the method according to any one of claims 1-11.
CN202311090087.1A 2023-08-28 2023-08-28 Expression recognition method and device, electronic equipment and storage medium Active CN116824677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311090087.1A CN116824677B (en) 2023-08-28 2023-08-28 Expression recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311090087.1A CN116824677B (en) 2023-08-28 2023-08-28 Expression recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116824677A CN116824677A (en) 2023-09-29
CN116824677B true CN116824677B (en) 2023-12-12

Family

ID=88120656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311090087.1A Active CN116824677B (en) 2023-08-28 2023-08-28 Expression recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116824677B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487989A (en) * 2020-12-01 2021-03-12 重庆邮电大学 Video expression recognition method based on capsule-long-and-short-term memory neural network
CN113111700A (en) * 2021-02-24 2021-07-13 浙江大华技术股份有限公司 Training method of image generation model, electronic device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9147131B2 (en) * 2012-07-30 2015-09-29 Evernote Corporation Extracting multiple facial photos from a video clip

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487989A (en) * 2020-12-01 2021-03-12 重庆邮电大学 Video expression recognition method based on capsule-long-and-short-term memory neural network
CN113111700A (en) * 2021-02-24 2021-07-13 浙江大华技术股份有限公司 Training method of image generation model, electronic device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Automatic facial expression analysis:A survey;FASEL B et al;《Pattern Recognition》;第36卷(第1期);第259-275页 *
基于层级注意力模型的视频序列表情识别;王晓华 等;《计算机辅助设计与图形学学报》(第1期);第20-26页 *

Also Published As

Publication number Publication date
CN116824677A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN111898696B (en) Pseudo tag and tag prediction model generation method, device, medium and equipment
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
CN111563502B (en) Image text recognition method and device, electronic equipment and computer storage medium
WO2022001623A1 (en) Image processing method and apparatus based on artificial intelligence, and device and storage medium
CN112949786A (en) Data classification identification method, device, equipment and readable storage medium
CN109034069B (en) Method and apparatus for generating information
WO2022105125A1 (en) Image segmentation method and apparatus, computer device, and storage medium
CN116824278B (en) Image content analysis method, device, equipment and medium
CN114648638A (en) Training method of semantic segmentation model, semantic segmentation method and device
CN110781413A (en) Interest point determining method and device, storage medium and electronic equipment
CN111078940B (en) Image processing method, device, computer storage medium and electronic equipment
CN115131698B (en) Video attribute determining method, device, equipment and storage medium
CN114021582B (en) Spoken language understanding method, device, equipment and storage medium combined with voice information
CN114896067A (en) Automatic generation method and device of task request information, computer equipment and medium
CN114387061A (en) Product pushing method and device, electronic equipment and readable storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN114706985A (en) Text classification method and device, electronic equipment and storage medium
CN113704393A (en) Keyword extraction method, device, equipment and medium
CN113705293A (en) Image scene recognition method, device, equipment and readable storage medium
CN114783597B (en) Method and device for diagnosing multi-class diseases, electronic equipment and storage medium
CN116824677B (en) Expression recognition method and device, electronic equipment and storage medium
CN111753618A (en) Image recognition method and device, computer equipment and computer readable storage medium
CN115457329A (en) Training method of image classification model, image classification method and device
CN112312205B (en) Video processing method and device, electronic equipment and computer storage medium
CN113011320A (en) Video processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant