CN113239799A

CN113239799A - Training method, recognition method, device, electronic equipment and readable storage medium

Info

Publication number: CN113239799A
Application number: CN202110516757.6A
Authority: CN
Inventors: 陶大程; 翟英杰
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-10
Anticipated expiration: 2041-05-12
Also published as: CN113239799B

Abstract

The disclosure provides a training method, a recognition device, electronic equipment and a readable storage medium, and relates to the technical field of emotion recognition. The method for training the emotion recognition model comprises the following steps: acquiring a posture branch sample and an action branch sample; training a posture branch model according to the posture branch sample to obtain first prediction information and emotion regression limit information; training an action branch model according to the action branch sample to obtain second prediction information; determining a first branch loss function according to the posture branch sample, the first prediction information and the emotion regression limit information so as to optimize the posture branch model according to the first branch loss function; determining a second branch loss function according to the action branch sample and the second prediction information so as to optimize the action branch model according to the second branch loss function; and generating an emotion recognition model according to the optimized posture branch model and the optimized action branch model. By the technical scheme, the robustness of the emotion recognition model obtained by training is improved.

Description

Training method, recognition method, device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of emotion recognition technologies, and in particular, to a method for training an emotion recognition model, a method for target emotion recognition, an apparatus, an electronic device, and a computer-readable storage medium.

Background

The emotion type of the human is recognized through recognition of information such as human voice, expression and gait, and the machine is endowed with the ability of perceiving and understanding human emotional states.

In the related technology, gait-based emotion type identification is realized based on a deep learning method, specifically, joint point position information in a human gait video is extracted, then the joint point position information of each frame is directly input into a deep neural network for deep feature learning, further, emotion features and the deep features are spliced for subsequent emotion prediction, and on one hand, the neural network only utilizes the position information of the joint points, so that the input information is insufficient in source and low in robustness.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a training method of an emotion recognition model, a target emotion recognition method, a target emotion recognition device, electronic equipment and a computer readable storage medium, which at least overcome the problems of low robustness and low reliability of gait emotion recognition in various scenes in the related art to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to one aspect of the disclosure, a method for training an emotion recognition model is provided, which includes: acquiring a gesture branch sample and an action branch sample based on the video sequence; training the gesture branch model according to the gesture branch sample, and obtaining first prediction information and emotion regression limit information; training the action branch model according to the action branch sample, and obtaining second prediction information; determining a first branch loss function from the postural branch samples, the first prediction information and the emotional regression limit information to optimize the postural branch model according to the first branch loss function; determining a second branch loss function according to the action branch sample and the second prediction information to optimize the action branch model according to the second branch loss function; and generating the emotion recognition model according to the optimized gesture branch model and the optimized action branch model.

In an embodiment of the present disclosure, the gesture branch samples include a one-hot code representing an emotion category and a multidimensional feature representing a priori emotional feature, and determining a first branch loss function according to the gesture branch samples, the first prediction information, and the emotion regression limit information specifically includes: constructing a first sub-loss function based on cross entropy according to the one-hot coding and the first prediction information; constructing a second sub-loss function based on mean square error according to the emotion regression limit information and the multi-dimensional features; determining the first branch loss function according to the first sub-loss function and the second sub-loss function.

In one embodiment of the present disclosure, further comprising: determining angle characteristics according to an included angle formed by three adjacent joint points in the human skeleton joint points; determining a distance feature from a Euclidean distance between two non-adjacent ones of the human skeletal joint points; determining an area characteristic according to the area of a triangle formed by the three adjacent joint points; and constructing the multidimensional feature representing the prior emotional feature according to the angle feature, the distance feature and the area feature.

In an embodiment of the present disclosure, the determining the second branch loss function according to the action branch sample and the second prediction information includes: constructing the second branch loss function based on cross entropy according to the one-hot coding and the second prediction information.

In an embodiment of the present disclosure, the gesture branch sample includes a first step sequence representing a gesture, and the training the gesture branch model according to the gesture branch sample and obtaining first prediction information and emotional regression constraint information specifically includes: inputting the first step sequence into a first mixed space-time diagram convolution network, and outputting a first two-dimensional feature; carrying out global average pooling on the first two-dimensional features, and outputting first depth features; inputting the first depth feature into a first full-connection layer, and outputting the emotion regression limit information; inputting the first depth feature into a second fully-connected layer, and outputting a first prediction vector; and carrying out normalization processing on the first prediction vector to obtain the first prediction information.

In an embodiment of the present disclosure, the motion branch sample includes a second step sequence representing a motion, and the training the motion branch model according to the motion branch sample and obtaining second prediction information specifically includes: inputting the second step state sequence into a second mixed space-time diagram convolution network, and outputting a second two-dimensional characteristic; performing global average pooling on the second two-dimensional features, and outputting second depth features; inputting the second depth feature into a third fully-connected layer, and outputting a second prediction vector; and carrying out normalization processing on the second prediction vector to obtain the second prediction information.

In an embodiment of the present disclosure, the obtaining a gesture branch sample and an action branch sample based on a video sequence specifically includes: extracting joint point coordinates of a target from a video sequence; constructing a first step state sequence representing a gesture and a second step state sequence representing a motion according to the joint point coordinates; generating the posture branch sample based on the first step sequence, the one-hot code representing the emotion category and the multi-dimensional feature representing the prior emotion feature; generating the action branch sample based on the second step sequence and the one-hot encoding representing the emotion category.

According to another aspect of the present disclosure, there is provided a target emotion recognition method, including: extracting joint point coordinates of a target from a video sequence; constructing a first step state sequence representing a gesture and a second step state sequence representing a motion according to the joint point coordinates; inputting the first step state sequence into a posture branch model and outputting a first emotion score; inputting the second step state sequence into an action branch model, and outputting a second emotion score; and determining the emotion recognition result of the target according to the first emotion score and the second emotion score.

In an embodiment of the present disclosure, the extracting joint coordinates of the target in the video sequence specifically includes: extracting human body joint points of the target based on a joint point extraction network; and obtaining the joint point coordinates of the target according to the dimension attribute of each joint point, the time sequence length of the human body joint points and the number of the human body joint points included in each frame in the video sequence.

In an embodiment of the present disclosure, the constructing a first step state sequence representing a gesture according to the joint point coordinates specifically includes: determining a first dimension corresponding to the gesture; extracting the spatial position coordinates of the human body joint points in the dimension attributes according to the first dimension; constructing the first step sequence based on the spatial position coordinates, the time sequence length and the number of the human body joint points.

In an embodiment of the present disclosure, the constructing a second step state sequence representing a motion according to the joint point coordinates specifically includes: determining a second dimension corresponding to the action; extracting a multi-dimensional motion vector of the human body joint point in the dimension attribute according to the second dimension; constructing the two-step state sequence based on the multi-dimensional motion vector, the time sequence length and the number of the human body joint points.

In an embodiment of the present disclosure, the determining an emotion recognition result of the target according to the first emotion score and the second emotion score specifically includes: calculating a mean of the first sentiment score and the second sentiment score; and determining the emotion type corresponding to the mean value, and determining the emotion type as the emotion recognition result.

According to another aspect of the present disclosure, there is provided an emotion recognition model training apparatus, including: the acquisition module is used for acquiring a posture branch sample and an action branch sample based on the video sequence; the first training module is used for training the gesture branch model according to the gesture branch sample and obtaining first prediction information and emotion regression limit information; the second training module is used for training the action branch model according to the action branch sample and obtaining second prediction information; a first optimization module to determine a first branch loss function based on the postural branch samples, the first prediction information, and the emotional regression constraint information to optimize the postural branch model based on the first branch loss function; a second optimization module, configured to determine a second branch loss function according to the action branch sample and the second prediction information, so as to optimize the action branch model according to the second branch loss function; and the generating module is used for generating the emotion recognition model according to the optimized gesture branch model and the optimized action branch model.

According to another aspect of the present disclosure, there is provided a target emotion recognition apparatus including: the extraction module is used for extracting the joint point coordinates of the target in the video sequence; the construction module is used for constructing a first step state sequence representing the gesture and a second step state sequence representing the motion according to the joint point coordinates; the first processing module is used for inputting the first step state sequence into a posture branch model and outputting a first emotion score; the second processing module is used for inputting the second step state sequence into an action branch model and outputting a second emotion score; and the determining module is used for determining the emotion recognition result of the target according to the first emotion score and the second emotion score.

According to still another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to execute any one of the above methods for training the emotion recognition model and/or the target emotion recognition method via executing the executable instructions.

According to yet another aspect of the present disclosure, there is provided a computer readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the method for training an emotion recognition model and/or the method for target emotion recognition described in any of the above.

According to the training scheme of the target emotion model, the posture related information and the action related information are respectively extracted from the same data source to construct the posture branch sample and the action branch sample, the posture branch model is trained based on the posture branch sample, and the action branch model is trained based on the action branch sample, so that various emotion-related factors are more fully considered in the trained model, and the robustness of the emotion recognition model obtained by training is improved.

Further, a loss function is constructed based on the branch samples and the output information of the training model, so that the trained model is optimized based on the loss function, and the identification precision of the model is further improved.

Furthermore, emotion regression restriction information is introduced into the loss function of the posture branch model, so that emotion information is distilled into the model, the model learns robust distinguishing characteristics, and the reliability of emotion recognition results output by the model can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 is a schematic diagram illustrating the structure of a target emotion recognition system in an embodiment of the present disclosure;

FIG. 2 is a flow chart of a target emotion recognition method in an embodiment of the present disclosure;

FIG. 3 is a flow chart of another target emotion recognition method in the disclosed embodiment;

FIG. 4 is a schematic diagram illustrating a human bone joint number in an embodiment of the present disclosure;

FIG. 5 is a flow chart of a further target emotion recognition method in the disclosed embodiment;

FIG. 6 is a flow chart illustrating a process of an emotion recognition model in an embodiment of the present disclosure;

FIG. 7 is a flow chart of a further target emotion recognition method in an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a human skeletal joint sequence for emotion recognition in an embodiment of the present disclosure;

FIG. 9 is a flow chart of a further target emotion recognition method in an embodiment of the present disclosure;

FIG. 10 is a flow chart of a further target emotion recognition method in an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a target emotion recognition apparatus in an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of another target emotion recognition apparatus in the disclosed embodiment;

fig. 13 shows a schematic diagram of an electronic device in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

For ease of understanding, the following first explains several terms referred to in this application.

Cross entropy loss: cross entropy describes the distance between two probability distributions, with smaller distances indicating that the two probabilities are closer together and larger indicating that the two probabilities differ more.

Loss of mean square error: mean square error means averaging the square of the difference of the n outputs of the n samples in a batch with the desired output.

The scheme provided by the embodiment of the application relates to technologies such as databases, and is specifically explained by the following embodiment.

Fig. 1 shows a schematic structural diagram of a target emotion recognition system in an embodiment of the present disclosure, which includes a plurality of terminals 120 and a server cluster 140.

The terminal 120 may be a mobile terminal such as a mobile phone, a game console, a tablet Computer, an e-book reader, smart glasses, an MP4(Moving Picture Experts Group Audio Layer IV) player, an intelligent home device, an AR (Augmented Reality) device, a VR (Virtual Reality) device, or a Personal Computer (PC), such as a laptop Computer and a desktop Computer.

Among them, the terminal 120 may have an application program installed therein for providing target emotion recognition.

The terminals 120 are connected to the server cluster 140 through a communication network. Optionally, the communication network is a wired network or a wireless network.

The server cluster 140 is a server, or is composed of a plurality of servers, or is a virtualization platform, or is a cloud computing service center. The server cluster 140 is used to provide background services for applications that provide targeted emotion recognition. Optionally, the server cluster 140 undertakes primary computational work and the terminal 120 undertakes secondary computational work; alternatively, the server cluster 140 undertakes secondary computing work and the terminal 120 undertakes primary computing work; alternatively, the terminal 120 and the server cluster 140 perform cooperative computing by using a distributed computing architecture.

Alternatively, the clients of the applications installed in different terminals 120 are the same, or the clients of the applications installed on two terminals 120 are clients of the same type of application of different control system platforms. Based on different terminal platforms, the specific form of the client of the application program may also be different, for example, the client of the application program may be a mobile phone client, a PC client, or a World Wide Web (Web) client.

Those skilled in the art will appreciate that the number of terminals 120 described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds of the terminals, or more. The number of terminals and the type of the device are not limited in the embodiments of the present application.

Optionally, the system may further include a management device (not shown in fig. 1), and the management device is connected to the server cluster 140 through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

Hereinafter, each step of the target emotion recognition method in the present exemplary embodiment will be described in more detail with reference to the drawings and examples.

FIG. 2 shows a flowchart of a target emotion recognition method in an embodiment of the present disclosure. The method provided by the embodiment of the present disclosure may be performed by any electronic device with computing processing capability, for example, the terminal 120 and/or the server cluster 140 in fig. 1. In the following description, the terminal 120 is taken as an execution subject for illustration.

As shown in fig. 2, the terminal 120 performs the method for training the emotion recognition model, which includes the following steps:

in step S202, gesture branch samples and motion branch samples are obtained based on the video sequence.

The method comprises the steps of extracting posture related information and action related information of the same video sequence to obtain a posture branch sample and an action branch sample, and performing joint training of an emotion recognition model based on the posture branch sample and the action branch sample.

Step S204, a posture branch model is trained according to the posture branch sample, and first prediction information and emotion regression limiting information are obtained.

The emotion regression limiting information is generated based on the priori expert knowledge, and the emotion regression limiting information is acquired, so that the expert knowledge can be transmitted to a network, the model can learn the robust features with higher discriminant power, and the emotion recognition effect is improved.

That is, the gesture branch model further includes two branches including an emotion kind prediction branch outputting first prediction information and an emotion regression restriction branch outputting emotion regression restriction information.

And step S206, training an action branch model according to the action branch sample, and obtaining second prediction information.

The gesture branch model and the action branch model are respectively trained, so that the training model of the emotion recognition model is generated based on the gesture branch model and the action branch model, more factors related to emotion recognition can be introduced into the model, and the robustness of the model is further improved.

Step S208, determining a first branch loss function according to the posture branch sample, the first prediction information and the emotion regression limiting information so as to optimize the posture branch model according to the first branch loss function.

The reliability and effectiveness of model optimization are guaranteed by constructing a first branch loss function based on the branch samples, the output first prediction information and the emotion regression limit information.

Step S210, determining a second branch loss function according to the action branch sample and the second prediction information, so as to optimize the action branch model according to the second branch loss function.

And constructing a second branch loss function based on the action branch sample and the second prediction information, so that the reliability and effectiveness of model optimization are ensured.

And step S212, generating an emotion recognition model according to the optimized gesture branch model and the optimized action branch model.

The emotion recognition model is constructed through the posture branch model and the action branch model, so that the obtained emotion recognition model can take account of the influence of the posture and the action on human emotion recognition.

In the embodiment, the posture related information and the action related information are respectively extracted from the same data source to construct the posture branch sample and the action branch sample, the posture branch model is trained based on the posture branch sample, and the action branch model is trained based on the action branch sample, so that the trained model can take various emotion-related factors into consideration more fully, and the robustness of the trained emotion recognition model can be improved.

As shown in fig. 3, in an embodiment of the present disclosure, the gesture branch samples include a one-hot code representing an emotion category and a multidimensional feature representing a priori emotional feature, and in step S208, a specific implementation manner of determining the first branch loss function according to the gesture branch samples, the first prediction information, and the emotion regression restriction information specifically includes:

step S302, a first sub-loss function based on cross entropy is constructed according to the one-hot coding and the first prediction information.

And generating an emotion label based on the emotion type, and representing the emotion label by adopting one-hot coding.

One-Hot coding, i.e., One-Hot coding, is also known as One-bit efficient coding. The method uses an N-bit status register to encode N states, each state having an independent register bit and only one of which is active at any time.

Specifically, it is provided

And (3) representing the output of the last fully-connected layer of the lower emotion category prediction branch, and then performing normalization processing through a softmax function, wherein the normalization processing process is shown as a formula (1).

Wherein the content of the first and second substances,

as first prediction information, C_nRepresenting the mood category.

Then, optimizing and setting emotion category label prediction based on cross entropy loss

One-hot encoding of the sample emotion label, determining a first sub-loss function based on the first prediction information and the one-hot encoding, as shown in formula (2).

And step S304, constructing a second sub-loss function based on the mean square error according to the emotion regression limit information and the multi-dimensional features.

By using

Representing emotional regression restriction information, using

And the multidimensional characteristic representing the prior emotional characteristic is used for optimizing the emotional regression restriction branch outputting the emotional regression restriction information based on a mean square error loss mode, and a second sub-loss function is shown as a formula (3).

Wherein, C_f31 denotes the dimension of the prior emotional character,

the method is used for solving the problem that prior emotion knowledge is distilled into a network, so that distinguishing characteristics related to attitude branch learning and emotion expression are promoted, and the robustness of a model on emotion recognition is improved.

Step S306, determining a first branch loss function according to the first sub-loss function and the second sub-loss function.

The first branch loss function for a gesture branch is shown in equation (4).

Where α is used to control the ratio of the two part losses, set to 0.5 in the present method.

In one embodiment of the present disclosure, further comprising: determining angle characteristics according to an included angle formed by three adjacent joint points in the human skeleton joint points; determining distance characteristics according to Euclidean distances between two non-adjacent joint points in human skeleton joint points; determining an area characteristic according to the area of a triangle formed by three adjacent joint points; and constructing a multi-dimensional feature representing the prior emotional feature according to the angle feature, the distance feature and the area feature.

In the embodiment, the multidimensional feature is constructed based on the prior emotional feature, the prior emotional feature is the description of the spatial relationship information of the human skeletal joint points, is the feature closely related to emotional expression, and can be distilled into the network, so that the model is promoted to extract the more robust depth discrimination feature.

Specifically, the prior emotional features include three features, namely angles, distances and areas, the human skeletal joint sequences are labeled from 0 to 15, and the labeling result is shown in fig. 4.

For the angle feature, the size of an included angle formed by three adjacent bone nodes as shown in fig. 4 is calculated, for the distance feature, the euclidean distance of two non-adjacent nodes is calculated, and for the area feature, the size of a triangle formed by three nodes is calculated. The specific information of each emotional feature used in the present disclosure is shown in table 1 below, and the total 31-dimensional features include 14 angular features, 9 distance features, and 8 area features.

TABLE 1

As shown in fig. 5, in an embodiment of the present disclosure, the action branch samples include a one-hot code representing an emotion category, and in step S210, an implementation manner of determining the second branch loss function according to the action branch samples and the second prediction information specifically includes:

step S502, a second branch loss function based on cross entropy is constructed according to the one-hot coding and the second prediction information.

Specifically, it is provided

The output of the full connection layer in the moving branch layer network is expressed, and the normalization processing is carried out through a softmax function, and the process of the normalization processing is shown as a formula (5).

Similarly, the motion branch model is optimized using a cross entropy loss function, the second branch function being shown in equation (6).

Overall, the gesture branch and the motion branch are optimized simultaneously, i.e. representing two different angles of emotion recognition, and the overall loss function is shown in equation (7).

Where λ is used to control the weight of the two-part branch, which in this disclosure can be set to 0.5.

As shown in fig. 6, in an embodiment of the present disclosure, the gesture branch samples include a first step sequence representing a gesture, the action branch samples include a second step sequence representing a motion, the first step sequence 604 and the second step sequence 620 are generated based on the video sequence samples 602, the gesture branch model is trained according to the gesture branch samples, and the first prediction information and the emotion regression restriction information are obtained, which specifically includes: inputting the first step sequence 604 into a first hybrid space-time graph convolutional network 606, and outputting a first two-dimensional feature; performing global average pooling on the first two-dimensional feature by a first global average pooling processing module 608, and outputting a first depth feature; inputting the first depth feature into the first fully-connected layer 610, and outputting emotion regression restriction information 612; inputting the first depth feature into the second fully-connected layer 614, and outputting a first prediction vector; the first prediction vector is normalized by the first normalization module 616 to obtain the first prediction information 618.

As shown in fig. 6, in an embodiment of the present disclosure, the motion branch sample includes a second step sequence representing a motion, the training of the motion branch model according to the motion branch sample, and obtaining second prediction information specifically includes: inputting the second step sequence 620 into a second hybrid space-time graph convolutional network 622, and outputting a second two-dimensional feature; the second two-dimensional feature is subjected to global average pooling processing by a second global average pooling processing module 624, and a second depth feature is output; inputting the second depth feature into the third fully-connected layer 626 and outputting a second prediction vector; the second prediction vector is normalized by the second normalization module 628 to obtain the second prediction information 630.

In an embodiment of the present disclosure, obtaining a gesture branch sample and an action branch sample based on a video sequence specifically includes: extracting joint point coordinates of a target from a video sequence; constructing a first step state sequence representing the gesture and a second step state sequence representing the motion according to the joint point coordinates; generating a posture branch sample based on the first step sequence, the single-hot code representing the emotion types and the multi-dimensional feature representing the prior emotion feature; and generating action branch samples based on the second step state sequence and the one-hot codes representing the emotion types.

Specifically, let the posture branch sample be (G)₁,y₁,y₂) Wherein G is₁＝(V₁E) represents a first step state sequence of gesture branch inputs,

is a one-hot encoding of the sentiment tag of this input sample, wherein

Indicates that the label is j, C_nIndicating the number of prediction classes.

Representing a priori emotional characteristics, wherein C_f31 denotes the dimension of the prior emotional characteristics, each node

Using 3D spatial position coordinates (c)_x,c_y,c_z) And (4) showing.

Let the motion branch input sample be (G)₂,y₁) Wherein G is₂＝(V₂And E) a gait sequence of the input of the motion branch, each node

The 8-dimensional motion vector m mentioned above is used to extract emotion-related clues from motion information, and the robustness of the model is further improved by increasing the sources of information and modeling related networks.

The emotion recognition model obtained by training the emotion recognition model based on the above-mentioned training method is applied to emotion recognition of a target, and as shown in fig. 7, the terminal 120 executes the target emotion recognition method, which includes the following steps:

step S702, the joint coordinates of the target are extracted from the video sequence.

Step S704, a first step sequence representing a posture and a second step sequence representing a motion are constructed from the joint coordinates.

Step S706, inputting the first step sequence into the posture branch model, and outputting a first emotion score.

Step S708, the second step sequence is input to the action branch model, and the second emotion score is output.

And step S710, determining the emotion recognition result of the target according to the first emotion score and the second emotion score.

In the embodiment, the joint point coordinates of the target are extracted from the video sequence, a first step state sequence used for representing the posture and a second step state training used for representing the motion are respectively constructed on the basis of the joint point coordinates, the emotion recognition based on the target posture is carried out on the basis of the first step state sequence, the emotion recognition based on the motion is carried out on the basis of the second step state sequence, the emotion recognition result of the target in the video sequence is obtained according to the emotion recognition results of the two branches, and the accuracy of the emotion recognition is improved by comprehensively considering the factors influencing the emotion in the gait.

In an embodiment of the present disclosure, extracting joint coordinates of a target in a video sequence specifically includes: extracting a human body joint point of a target based on a joint point extraction network; and obtaining the coordinates of the joint points of the target according to the dimension attribute of each joint point, the time sequence length of the human body joint points and the number of the human body joint points in each frame in the video sequence.

As shown in fig. 8, a video sequence is a time-based human joint point sequence extracted through a joint point network, and the human joint point sequence may be represented as a series of joint point coordinates C × T × N, where C represents an attribute dimension of each joint point, T ═ 48 is the length of the time sequence, and N ═ 16 is the number of one human joint point in a single frame.

As shown in fig. 9, in an embodiment of the present disclosure, in step S704, an implementation manner of constructing a first step sequence representing a gesture according to joint coordinates specifically includes:

in step S902, a first dimension corresponding to the gesture is determined.

And step S904, extracting the space position coordinates of the human body joint points in the dimension attributes according to the first dimension.

Step S906, constructing the first step sequence based on the space position coordinates, the time sequence length and the number of the human body joint points.

In this embodiment, for a corresponding first sequence of step states of the gesture branch model, each nodeUsing its 3D spatial position coordinates (c)_x,c_y,c_z) Meaning that C is 3 to obtain a first emotion score based on the first step sequence.

As shown in fig. 10, in an embodiment of the present disclosure, in step S704, an implementation of constructing a second step state sequence representing a motion according to joint coordinates specifically includes:

step S1002, a second dimension corresponding to the action is determined.

Step S1004, extracting a multi-dimensional motion vector of the human body joint point in the dimension attribute according to the second dimension.

Step S1006, constructing the two-step state sequence based on the multi-dimensional motion vector, the time sequence length and the number of human body joint points.

In this embodiment, for the second step state sequence of the motion branch model, C is 8, and each node information is represented by an 8-dimensional motion vector m (v)_x,v_y,v_z,|v|,a_x,a_y,a_zAnd | a |) represents, wherein the component velocity size and the integral module value of the velocity v of three position directions of each node space are respectively represented, the three directions of the acceleration a are the component velocity size and the integral module value, and the velocity acceleration is respectively calculated by the position difference and the velocity difference between frames. The relationship of the input nodes may be represented by a graph G ═ (V, E), where V represents a node, E represents an edge,

wherein

Representing the ith node of the t-th frame.

In an embodiment of the present disclosure, determining an emotion recognition result of a target according to a first emotion score and a second emotion score specifically includes: calculating the mean of the first emotion score and the second emotion score; and determining the emotion type corresponding to the mean value, and determining the emotion type as an emotion recognition result.

Specifically, the emotion recognition scheme mainly includes an execution flow of two branches, as shown in fig. 6, that is, a gesture branch model and a motion branch model, for an original video sequence, we first use an open-source joint extraction Network (e.g., openpos) to extract 3D position coordinates of joints, then respectively construct a first step sequence and a second step sequence of the gesture Network branch and the motion Network branch based on the extracted 3D position coordinates, then input the first step sequence and the second step sequence into a subsequent hybrid space-time Graph Convolutional Network (STGCN) to extract features, perform global average pooling on the features, and then extract the depth features to obtain different outputs through a full connection layer.

For the attitude branch, the depth branch is divided into two sub-branches after passing through two different fully-connected layers, one branch is subjected to emotion regression limitation on output by using prior emotion feature knowledge, the other branch is used for emotion label prediction, for the motion branch, the depth feature is only subjected to emotion label prediction after passing through the fully-connected layers, and the final output of the network in the test stage is the weighted average of the emotion prediction score of the attitude branch and the emotion prediction score of the motion branch.

It is to be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

The emotion recognition model training apparatus 1100 according to this embodiment of the present invention is described below with reference to fig. 11. The emotion recognition model training apparatus 1100 shown in fig. 11 is only an example, and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.

The emotion recognition model training apparatus 1100 is represented in the form of a hardware module. The components of the emotion recognition model training apparatus 1100 may include, but are not limited to: an obtaining module 1102 for obtaining a gesture branch sample and an action branch sample based on the video sequence; a first training module 1104, configured to train the gesture branch model according to the gesture branch sample, and obtain first prediction information and emotion regression restriction information; a second training module 1106, configured to train the action branch model according to the action branch sample, and obtain second prediction information; a first optimization module 1108 for determining a first branch loss function based on the postural branch samples, the first prediction information, and the emotional regression constraint information, to optimize the postural branch model based on the first branch loss function; a second optimization module 1110, configured to determine a second branch loss function according to the action branch sample and the second prediction information, so as to optimize the action branch model according to the second branch loss function; a generating module 1112, configured to generate the emotion recognition model according to the optimized gesture branch model and the optimized action branch model.

The target emotion recognition apparatus 1200 according to this embodiment of the present invention is described below with reference to fig. 12. The target emotion recognition apparatus 1200 shown in fig. 12 is only an example, and should not bring any limitation to the function and the range of use of the embodiment of the present invention.

The target emotion recognition apparatus 1200 is represented in the form of a hardware module. The components of the target emotion recognition device 1200 may include, but are not limited to: an extraction module 1202, configured to extract joint coordinates of a target in a video sequence; a construction module 1204, configured to construct a first step sequence representing a gesture and a second step sequence representing a motion according to the joint coordinates; a first processing module 1206, configured to input the first step sequence into a gesture branch model, and output a first emotion score; the second processing module 1208, configured to input the second step sequence into an action branch model, and output a second emotion score; a determining module 1210, configured to determine an emotion recognition result of the target according to the first emotion score and the second emotion score.

An electronic device 1300 according to this embodiment of the invention is described below with reference to fig. 13. The electronic device 1300 shown in fig. 13 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 13, the electronic device 1300 is in the form of a general purpose computing device. The components of the electronic device 1300 may include, but are not limited to: the at least one processing unit 1310, the at least one memory unit 1320, and the bus 1330 connecting the various system components including the memory unit 1320 and the processing unit 1310.

Where the memory unit stores program code, the program code may be executed by the processing unit 1310 to cause the processing unit 1310 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of this specification. For example, the processing unit 1310 may perform steps S202, S204, S206 to S212 as shown in fig. 1, and other steps defined in the target emotion recognition method of the present disclosure.

The storage 1320 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)13201 and/or a cache memory unit 13202, and may further include a read-only memory unit (ROM) 13203.

Storage unit 1320 may also include a program/utility 13204 having a set (at least one) of program modules 13205, such program modules 13205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1330 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1300 may also communicate with one or more external devices 1400 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1300 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 1350. Also, the electronic device 1300 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 1360. As shown, the network adapter 1360 communicates with other modules of the electronic device 1300 via the bus 1330. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when the program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for training an emotion recognition model, wherein the emotion recognition model comprises a gesture branch model and an action branch model, and the method comprises the following steps:

acquiring a gesture branch sample and an action branch sample based on the video sequence;

training the gesture branch model according to the gesture branch sample, and obtaining first prediction information and emotion regression limit information;

training the action branch model according to the action branch sample, and obtaining second prediction information;

determining a first branch loss function from the postural branch samples, the first prediction information and the emotional regression limit information to optimize the postural branch model according to the first branch loss function;

determining a second branch loss function according to the action branch sample and the second prediction information to optimize the action branch model according to the second branch loss function;

and generating the emotion recognition model according to the optimized gesture branch model and the optimized action branch model.

2. The method for training an emotion recognition model according to claim 1, wherein the gesture branch samples include a one-hot code representing an emotion category and a multidimensional feature representing a priori emotional features, and the determining a first branch loss function according to the gesture branch samples, the first prediction information, and the emotion regression limit information specifically includes:

constructing a first sub-loss function based on cross entropy according to the one-hot coding and the first prediction information;

constructing a second sub-loss function based on mean square error according to the emotion regression limit information and the multi-dimensional features;

determining the first branch loss function according to the first sub-loss function and the second sub-loss function.

3. The method for training the emotion recognition model according to claim 2, further comprising:

determining angle characteristics according to an included angle formed by three adjacent joint points in the human skeleton joint points;

determining a distance feature from a Euclidean distance between two non-adjacent ones of the human skeletal joint points;

determining an area characteristic according to the area of a triangle formed by the three adjacent joint points;

and constructing the multidimensional feature representing the prior emotional feature according to the angle feature, the distance feature and the area feature.

4. The method for training an emotion recognition model according to claim 1, wherein the action branch samples include a one-hot code representing an emotion category, and the determining a second branch loss function according to the action branch samples and the second prediction information specifically includes:

constructing the second branch loss function based on cross entropy according to the one-hot coding and the second prediction information.

5. The method for training an emotion recognition model according to claim 1, wherein the gesture branch sample includes a first step sequence representing a gesture, and the training of the gesture branch model according to the gesture branch sample and obtaining first prediction information and emotion regression restriction information specifically includes:

inputting the first step sequence into a first mixed space-time diagram convolution network, and outputting a first two-dimensional feature;

carrying out global average pooling on the first two-dimensional features, and outputting first depth features;

inputting the first depth feature into a first full-connection layer, and outputting the emotion regression limit information;

inputting the first depth feature into a second fully-connected layer, and outputting a first prediction vector;

and carrying out normalization processing on the first prediction vector to obtain the first prediction information.

6. The method for training an emotion recognition model according to claim 1, wherein the motion branch sample includes a second step sequence representing motion, and the training of the motion branch model according to the motion branch sample and obtaining second prediction information specifically includes:

inputting the second step state sequence into a second mixed space-time diagram convolution network, and outputting a second two-dimensional characteristic;

performing global average pooling on the second two-dimensional features, and outputting second depth features;

inputting the second depth feature into a third fully-connected layer, and outputting a second prediction vector;

and carrying out normalization processing on the second prediction vector to obtain the second prediction information.

7. The method for training the emotion recognition model according to any one of claims 1 to 6, wherein the obtaining of the gesture branch samples and the action branch samples based on the video sequence specifically includes:

extracting joint point coordinates of a target in the video sequence;

constructing a first step state sequence representing a gesture and a second step state sequence representing a motion according to the joint point coordinates;

generating the posture branch sample based on the first step sequence, the one-hot code representing the emotion category and the multi-dimensional feature representing the prior emotion feature;

generating the action branch sample based on the second step sequence and the one-hot encoding representing the emotion category.

8. A target emotion recognition method is characterized by comprising the following steps:

extracting joint point coordinates of a target from a video sequence;

inputting the first step state sequence into a posture branch model and outputting a first emotion score;

inputting the second step state sequence into an action branch model, and outputting a second emotion score;

and determining the emotion recognition result of the target according to the first emotion score and the second emotion score.

9. The method for target emotion recognition according to claim 8, wherein the extracting of the joint coordinates of the target in the video sequence specifically comprises:

extracting human body joint points of the target based on a joint point extraction network;

and obtaining the joint point coordinates of the target according to the dimension attribute of each joint point, the time sequence length of the human body joint points and the number of the human body joint points included in each frame in the video sequence.

10. The method for recognizing target emotion according to claim 9, wherein the constructing a first step sequence representing a gesture according to the joint coordinates specifically includes:

determining a first dimension corresponding to the gesture;

extracting the spatial position coordinates of the human body joint points in the dimension attributes according to the first dimension;

constructing the first step sequence based on the spatial position coordinates, the time sequence length and the number of the human body joint points.

11. The method for recognizing target emotion according to claim 9, wherein the constructing a second step sequence representing motion according to the joint coordinates specifically includes:

determining a second dimension corresponding to the action;

extracting a multi-dimensional motion vector of the human body joint point in the dimension attribute according to the second dimension;

constructing the two-step state sequence based on the multi-dimensional motion vector, the time sequence length and the number of the human body joint points.

12. The method for target emotion recognition according to any one of claims 8 to 11, wherein the determining the emotion recognition result of the target according to the first emotion score and the second emotion score specifically comprises:

calculating a mean of the first sentiment score and the second sentiment score;

and determining the emotion type corresponding to the mean value, and determining the emotion type as the emotion recognition result.

13. An emotion recognition model training apparatus, wherein the emotion recognition model includes a gesture branch model and an action branch model, the training apparatus comprising:

the acquisition module is used for acquiring a posture branch sample and an action branch sample based on the video sequence;

the first training module is used for training the gesture branch model according to the gesture branch sample and obtaining first prediction information and emotion regression limit information;

the second training module is used for training the action branch model according to the action branch sample and obtaining second prediction information;

a first optimization module to determine a first branch loss function based on the postural branch samples, the first prediction information, and the emotional regression constraint information to optimize the postural branch model based on the first branch loss function;

a second optimization module, configured to determine a second branch loss function according to the action branch sample and the second prediction information, so as to optimize the action branch model according to the second branch loss function;

and the generating module is used for generating the emotion recognition model according to the optimized gesture branch model and the optimized action branch model.

14. A target emotion recognition apparatus, comprising:

the extraction module is used for extracting the joint point coordinates of the target in the video sequence;

the construction module is used for constructing a first step state sequence representing the gesture and a second step state sequence representing the motion according to the joint point coordinates;

the first processing module is used for inputting the first step state sequence into a posture branch model and outputting a first emotion score;

the second processing module is used for inputting the second step state sequence into an action branch model and outputting a second emotion score;

and the determining module is used for determining the emotion recognition result of the target according to the first emotion score and the second emotion score.

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the method for training the emotion recognition model according to any one of claims 1 to 7 and/or the method for recognizing the target emotion according to any one of claims 8 to 12 by executing the executable instructions.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for training an emotion recognition model as claimed in any one of claims 1 to 7 and/or a method for target emotion recognition as claimed in any one of claims 8 to 12.