CN112241715A

CN112241715A - Model training method, expression recognition method, device, equipment and storage medium

Info

Publication number: CN112241715A
Application number: CN202011146349.8A
Authority: CN
Inventors: 王珂尧
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-19

Abstract

The application discloses a model training method, an expression recognition method, a device, equipment and a storage medium, and relates to the field of artificial intelligence. The specific implementation scheme is as follows: acquiring a plurality of sample images, wherein the sample images comprise face areas; determining expression characteristics of facial regions contained in the sample images, and classifying the sample images based on the expression characteristics to obtain at least two sample sets, wherein the expression characteristics of the sample images in different sample sets are different; determining a sampling probability for the sample set based at least on a number of sample images in the sample set; and extracting sample images from the sample set as training samples based on the sampling probability, and training a situation recognition model by using at least the training samples. Therefore, the problem of existing sample imbalance is solved, and a foundation is laid for improving the recognition rate of the expression recognition model.

Description

Model training method, expression recognition method, device, equipment and storage medium

Technical Field

The application relates to the field of data processing, in particular to the field of artificial intelligence, and specifically relates to the technical field of computer vision and deep learning.

Background

The facial expression data in the real scene are very limited, particularly, expressions such as fear and disgust are difficult to obtain, and calmness and happiness are easy to obtain relatively, so that the sample is not uniform, and when the expression recognition model is trained, the recognition rate of the expression recognition model is reduced inevitably due to the fact that the sample is not uniform.

Disclosure of Invention

The application provides a model training method, an expression recognition device, equipment and a storage medium.

According to an aspect of the present application, there is provided a model training method, including:

acquiring a plurality of sample images, wherein the sample images comprise face areas;

determining expression characteristics of facial regions contained in the sample images, and classifying the sample images based on the expression characteristics to obtain at least two sample sets, wherein the expression characteristics of the sample images in different sample sets are different;

determining a sampling probability for the sample set based at least on a number of sample images in the sample set;

and extracting sample images from the sample set as training samples based on the sampling probability, and training a situation recognition model by using at least the training samples.

According to another aspect of the present application, there is provided an expression recognition method including:

acquiring a facial image to be subjected to expression recognition;

inputting the facial image into an expression recognition model, and outputting expression features matched with the facial image; the expression recognition model is obtained after the model training method is used for training.

According to still another aspect of the present application, there is provided a model training apparatus including:

the system comprises a sample image acquisition unit, a processing unit and a display unit, wherein the sample image acquisition unit is used for acquiring a plurality of sample images, and the sample images contain face areas;

the classification processing unit is used for determining expression characteristics of facial regions contained in the sample images, classifying the sample images based on the expression characteristics to obtain at least two sample sets, wherein the expression characteristics of the sample images in different sample sets are different;

a sampling probability determination unit for determining a sampling probability for the sample set based on at least the number of sample images in the sample set;

and the model training unit is used for extracting sample images from the sample set as training samples based on the sampling probability and training the expression recognition model by using at least the training samples.

According to still another aspect of the present application, there is provided an expression recognition apparatus including:

the image processing device comprises a to-be-processed image determining unit, a face recognition unit and a face recognition unit, wherein the to-be-processed image determining unit is used for acquiring a face image to be subjected to expression recognition;

the expression recognition unit is used for outputting expression characteristics matched with the facial image after the facial image is input into an expression recognition model; the expression recognition model is obtained after the model training method is used for training.

According to still another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method described above; alternatively, the above-described expression recognition method is performed.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the model training method described above; the above-described expression recognition method is performed.

According to the technology of the application, the problem of unbalance of the existing samples is solved, and then a foundation is laid for improving the recognition rate of the expression recognition model.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart of an implementation of a model training method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating an implementation of a model training method according to an embodiment of the present disclosure in a specific example;

FIG. 3 is a schematic diagram of a flow chart of an implementation of a representation identification method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a representation identification method according to an embodiment of the present application;

fig. 6 is a block diagram of an electronic device for implementing a model training method or an expression recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present application provides a model training method, and specifically, fig. 1 is a schematic flow chart of an implementation of the model training method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step S101: a plurality of sample images are obtained, wherein the sample images comprise face areas.

Step S102: determining expression characteristics of facial regions contained in the sample images, and classifying the sample images based on the expression characteristics to obtain at least two sample sets, wherein the expression characteristics of the sample images in different sample sets are different.

Step S103: determining a sampling probability for the sample set based at least on a number of sample images in the sample set.

Step S104: and extracting sample images from the sample set as training samples based on the sampling probability, and training a situation recognition model by using at least the training samples.

Therefore, the sampling probability of the sample set can be determined based on the number of the sample images in the sample set, so that the problem of sample imbalance is solved from the sampling dimension, and a foundation is laid for subsequently improving the recognition rate of the expression recognition model.

In the scheme of the application, because the number of the sample images in different sample sets is different, the sampling probabilities corresponding to different sample sets can be different, so that the problem of sample imbalance is solved from the sampling dimension.

In an example, the facial area may be a human face area, and in this case, the expression features may be expressions of a human face, such as anger (anger), Disgust (dispust), Fear (Fear), happy (happenses), Sadness (Sadness), Surprise (surrise), Neutral (Neutral), and the like, so that the sample images are classified based on the expression features, the images with the same expression are divided into the same sample set, the images with different expressions are divided into different sample sets, the expression features of the sample images in different sample sets are different, and a plurality of sample sets are obtained, where each sample set includes at least one sample image.

It should be noted that the facial region may also be a facial image of another living body, such as an animal, and the like, and the scheme of the present application is not limited to facial expression recognition, and only different sample images need to be acquired based on different recognition objects.

In a specific example of the scheme of the application, a first loss function is set behind a full connection layer of the expression recognition model, and is used for determining a loss value of feature information output by the full connection layer; the weight of the parameter for characterizing each type of expressive feature in the first loss function is associated with the sampling probability. Therefore, the problem of sample imbalance is solved from the loss function dimension, and a foundation is laid for further improving the recognition rate of the expression recognition model.

In practical applications, the first loss function may be specifically a focal loss local function, which is disposed behind a full connection layer of the expression recognition model and used for calculating a loss value of feature information output by the full connection layer, and a weight of a parameter corresponding to an expression feature in the focal loss function is related to a sampling probability, for example, the weight of the parameter of the expression feature in the focal loss function is equal to the sampling probability of a sample set corresponding to the expression feature, so as to increase a less-sample-size local weight in a training set, thereby implementing sample equalization, and laying a foundation for further improving a recognition rate of the expression recognition model.

In a specific example of the solution of the present application, a second loss function is set after the last convolutional layer of the expression recognition model, and is used for determining a loss value of the feature information output by the last convolutional layer. Therefore, the problem of sample imbalance is solved from the loss function dimension, and a foundation is laid for further improving the recognition rate of the expression recognition model.

In practical applications, the second loss function may specifically be a center loss function, where the center loss function provides a center of each category (i.e. a sample set), so as to minimize a distance between a sample image in the intra-category data (i.e. inside the sample set) and the center of the category, thereby achieving a purpose of reducing the intra-category distance and further increasing the feature difference between the categories. Specifically, the center loss function is provided after the last convolutional layer of the expression recognition model, and is used for calculating a loss value of the feature information output by the last convolutional layer. And furthermore, a foundation is laid for realizing sample balance and further improving the recognition rate of the expression recognition model.

It should be noted that, in the actual training process, the two loss functions may be used alternatively, that is, only the first loss function is used, or only the second loss function is used; alternatively, both of the first and second loss functions may be used, and in this case, the loss values of the two loss functions may be added as the final loss value, and thus, model training may be performed based on the added loss values until the model converges.

In a specific example of the scheme of the application, weights are correspondingly set for parameters characterizing various expression features in the second loss function, and the weights of the parameters characterizing various expression features in the second loss function are associated with the sampling probability. That is, similar to the first loss function, the second loss function may also set a weight for the parameter of the expressive feature, for example, a weight for the parameter of the expressive feature is set in the center loss function, in this case, the weight for the parameter corresponding to the expressive feature in the center loss function is related to the sampling probability, such as equal to the sampling probability of the sample set corresponding to the kind of expressive feature. Therefore, the problem of sample imbalance is solved from the loss function dimension, and a foundation is laid for further improving the recognition rate of the expression recognition model.

It should be noted that, in the actual training process, the example method can be used to set the weight for the second loss function, regardless of whether the second loss function is used alone or in a scenario in which the first loss function and the second loss function are used together.

In a specific example of the scheme of the present application, the determining, in step S103, the sampling probability for the sample set based on at least the number of sample images in the sample set may specifically include:

obtaining the ratio of the sample set to all the sample images based on the ratio of the number of the sample images in the sample set to the total number of all the sample images; determining a sampling probability for the sample set based on the ratio. Therefore, the problem of sample imbalance is solved, the sample balance rate in the training process is improved, and a foundation is laid for improving the recognition rate of the expression recognition model.

Based on the example scheme, a larger weight can be set for the sample set with a smaller number of samples, namely data enhancement (data augmentation) is performed, so that the problem of sample imbalance is alleviated; for example, taking facial expressions as an example, the facial expressions can be classified into 7 types of basic expressions, Angry (Angry), Disgust (dispust), Fear (Fear), happy (happension), Sadness (Sadness), Surprise (surrise) and Neutral (Neutral) according to changes of facial muscles, and based on this, the acquired sample images are classified according to the 7 types of expressions to obtain seven sample sets. Suppose that the number of sample images in the 7 sample sets is n respectively₁、n₂、n₃、n₄、n₅、n₆、n₇The total number of samples (the number of all sample images) is n, and in this case, the probability (i.e., sampling probability) of performing the random data enhancement processing on each sample set may be n

And then, the random sampling is carried out according to the probability of random data enhancement processing, at the moment, if the sample number proportion in a certain sample set is small, the probability of random sampling of the sample set in the training process is larger, otherwise, the probability is smaller, so that the number of classes with less sample proportion is expanded, and the sample balance is realized. For example, the sample images of expressions such as fear and aversion are fewer, and at this time, the probability of random data enhancement processing is set, and the sample probability for expressions such as fear and aversion is greater, that is, the sampling probability is greater, so that the probability of the sample of expressions such as fear and aversion being extracted is greater, and thus, the probability of the sample image of expressions such as fear and aversion being taken as a training image in the mini-batch random sampling process is increased, and further, sample equalization is realized.

In a specific example of the scheme of the application, the training of the expression recognition model by using at least the training sample in the step S104 may specifically include: labeling the expression characteristics of the training sample; and at least inputting the training sample subjected to labeling processing into the expression recognition model to perform model training. So, accomplish the training to expression recognition model, and then obtain the expression recognition model after the training is accomplished, at this moment, this expression recognition model can discern facial expression promptly, and moreover, the rate of recognition is high, the rate of accuracy is high, when can richen practical scene, promotes user experience.

The present application scheme is further described in detail with reference to specific examples, and specifically, in order to solve the problem of low expression recognition accuracy caused by unbalanced sample distribution in real scene facial expressions (for example, facial expressions and the like, which are described below by taking facial expression recognition as an example), a sample-balanced roll-in neural network training method is provided, in a training process, the number of samples in each class (which can be classified based on expression features, i.e., the class corresponds to the sample set) in mini-batch is balanced, and a larger weight is set for the class with fewer samples in a training set, i.e., data augmentation (data augmentation) is performed, so that the problem of unbalanced samples in the training set is alleviated; meanwhile, a focal loss focal length function (namely, a first loss function) and a central loss center function (namely, a second loss function) are introduced in the training process, on one hand, the focal length is used for increasing the loss weight of a difficult sample (the class with fewer samples in the example can be called as a difficult sample) in a training set, and on the other hand, the center distance is used for reducing the intra-class distance to increase the distance between classes, namely, the problem of sample imbalance is relieved from two dimensions, so that the expression recognition model can learn expression information more easily and can be converged more easily, and the accuracy and the robustness of the expression recognition model under a complex environment can be greatly improved.

In practical application, the improvement of the accuracy of the expression recognition model is also beneficial to improving the service quality of various applications, for example, in the aspect of media information delivery, the method is beneficial to assisting in recommending search results which meet the requirements of users better, and realizing accurate delivery; in the aspect of distance education, the emotion recognition of students is facilitated to improve teaching contents and improve the quality of distance education; in the monitoring scene of the driver, the emotion of the driver can be recognized, and the driver is prompted correspondingly, so that the safety of the driver is guaranteed.

Here, the expression recognition model in this example may specifically employ a convolutional neural network. Specifically, fig. 2 is a schematic flow chart of an implementation of a model training method in a specific example according to an embodiment of the present application; as shown in fig. 2, the steps of the method are as follows:

a series of images (i.e., sample images) containing facial expressions are acquired, wherein in an example, the facial expressions can be classified into 7 types of basic expressions, namely anger (anger), Disgust (dispost), Fear (Fear), happy (happy), Sadness (Sadness), Surprise (surrise) and Neutral (Neutral) according to changes of facial muscles, and the acquired images are classified according to the 7 types of expressions. Defining 72 key points contained in the face at the same time, and respectively recording as (x)₁,y₁)，……，(x₇₂,y₇₂)。

Performing image preprocessing on each image, specifically, firstly, detecting a face region in the image through a detection model to obtain an approximate position region where a face is located, and referring an image corresponding to the approximate position region where the face is located as a target face image; here, the detection model may be adopted as an existing face detection model to detect the face position. Secondly, detecting the face key points of the target face image through a face key point detection model to obtain the coordinate values of the face key points; here, the face key point detection model is an existing model, the existing face key point detection model is called, and a target face image is input, so that 72 face key point coordinates can be obtained, namely (x)₁,y₁)，……，(x₇₂,y₇₂). Then, the face is aligned according to the key point coordinate values of the face, and is adjusted to a preset size, for example, to 128x128,new coordinate values are obtained. And finally, carrying out normalization processing on the obtained new 72 face key point coordinates. Here, the normalization processing refers to sequentially performing normalization processing on each pixel point in the target face image, and the normalization processing method includes: subtracting 128 from the pixel value of each pixel point and dividing by 256 to make the pixel value of each pixel be [ -0.5, 0.5 [)]To (c) to (d); it should be noted that, in practical applications, other normalization processing manners may also be used, and the present application is not limited to this.

Performing data enhancement processing on the normalized target face image, specifically, setting the intensity of random data enhancement according to the proportion of each type of sample in the whole sample, for example, the number of samples of 7 types of expressions in all images is n respectively₁、n₂、n₃、n₄、n₅、n₆、n₇If the total number of samples is n, the probability (i.e., sampling probability) of performing the random data enhancement processing on each class is n

And further, the random sampling is carried out according to the probability of random data enhancement processing, at the moment, if the proportion of the samples of a certain class is small, the probability of random sampling of the samples of the class in the training process is larger, otherwise, the probability is smaller, so that the number of the classes with less proportion of the samples is expanded, and the sample balance is realized. For example, the sample images of expressions such as fear and aversion are fewer, and at this time, the probability of random data enhancement processing is set, and the sample probability for expressions such as fear and aversion is greater, that is, the sampling probability is greater, so that the probability of the sample of expressions such as fear and aversion being extracted is greater, and thus, the probability of the sample image of expressions such as fear and aversion being taken as a training image in the mini-batch random sampling process is increased, and further, sample equalization is realized.

And training the convolutional neural network based on the image after the data enhancement processing. Here, in the training process of the present example, the VGG11 structure may be used as the feature extraction network of the convolutional neural network in the present example, and then, focal loss supervision is performed after the last fully-connected layer to increase the loss weight of the hard samples in the training data, so that the recognition capability of the model on a small number of samples (i.e., hard samples) is greatly improved.

In order to further realize sample equalization, the example also uses a center loss function, that is, the center loss function is added on the basis of the focal loss function, wherein the center loss function provides a class center for each class, and further, the distance between the sample in the intra-class data and the class center is minimized, so as to achieve the purpose of reducing the intra-class distance, and further, increase the feature difference between the classes. Here, the center loss formula is shown below, where c_yiDenotes the y th_iCenter of features of individual classes, x_iAnd m represents the size of the mini-batch.

In this example, the center loss function is provided after the last convolutional layer of the expression recognition model, and is used to calculate a loss value for the feature information output by the last convolutional layer. The focal loss function is arranged behind a full connection layer of the expression recognition model and used for calculating loss values of characteristic information output by the full connection layer. The two loss values are then added as the final loss value, and thus model training is performed based on the loss values until the model converges.

It should be noted that, first, the weight of the parameter corresponding to an expression in the focal loss function is related to the sampling probability, for example, the weight of the parameter of an expression in the focal loss function is equal to the sampling probability of the expression. Secondly, the weight of the parameter with expression is not related in the above formula of the center loss function, and in practical application, the weight of the parameter with expression may also be set similarly to focal loss, at this time, the weight of the parameter with expression in the center loss function is related to the sampling probability, for example, the weight of the parameter with expression in the center loss function is equal to the sampling probability of the expression.

Like this, compare with the training result that present does not consider sample quantity imbalance, the balanced purpose of sample can be realized under the artifical condition that increases the sample image to this application scheme, and then has promoted expression recognition model's recognition accuracy.

The scheme of the application also provides an expression recognition method, as shown in fig. 3, the method includes:

step S301: and acquiring a facial image to be subjected to expression recognition.

Step S302: inputting the facial image into an expression recognition model, and outputting expression features matched with the facial image; wherein the expression recognition model is obtained after being trained by the method of any one of claims 1 to 6.

Therefore, the expression recognition model is obtained after sample equalization processing, so that the expression characteristics obtained by the expression recognition model are more accurate, and a foundation is laid for enriching the use scenes of the expression recognition model and improving the user experience.

It should be noted that the scheme of the application can also be applied to the fields of visual interaction, intelligent control, driving assistance, remote education, accurate advertisement putting and the like.

The present application further provides a model training apparatus, as shown in fig. 4, including:

a sample image acquiring unit 401 configured to acquire a plurality of sample images, wherein the sample images include face regions;

a classification processing unit 402, configured to determine expression features of facial regions included in the sample images, and classify the multiple sample images based on the expression features to obtain at least two sample sets, where the expression features of the sample images in different sample sets are different;

a sampling probability determination unit 403 for determining a sampling probability for the sample set based on at least the number of sample images in the sample set;

a model training unit 404, configured to extract sample images from the sample set based on the sampling probability as training samples, and train a situation recognition model using at least the training samples.

In a specific example of the scheme of the application, a first loss function is set behind a full connection layer of the expression recognition model, and is used for determining a loss value of feature information output by the full connection layer; the weight of the parameter for characterizing each type of expressive feature in the first loss function is associated with the sampling probability.

In a specific example of the solution of the present application, a second loss function is set after the last convolutional layer of the expression recognition model, and is used for determining a loss value of the feature information output by the last convolutional layer.

In a specific example of the scheme of the application, weights are correspondingly set for parameters characterizing various expression features in the second loss function, and the weights of the parameters characterizing various expression features in the second loss function are associated with the sampling probability.

In a specific example of the scheme of the present application, the sampling probability determination unit includes:

the calculating subunit is used for obtaining the ratio of the sample set to all the sample images based on the ratio of the number of the sample images in the sample set to the total number of all the sample images;

a determination subunit to determine a sampling probability for the sample set based on the ratio.

In a specific example of the solution of the present application, the model training unit includes:

the labeling subunit is used for labeling the expression characteristics of the training sample;

and the model training subunit is used for inputting the training sample subjected to the labeling processing into the expression recognition model at least so as to perform model training.

The present application further provides an expression recognition apparatus, as shown in fig. 5, including:

a to-be-processed image determining unit 501, configured to acquire a facial image to be subjected to expression recognition;

an expression recognition unit 502, configured to output an expression feature matched with the facial image after the facial image is input to an expression recognition model; the expression recognition model is obtained after the model training method is used for training.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided. It should be noted that, in practical applications, the electronic device applied to the model training method and the electronic device applied to the expression recognition method may have similar structures, and similarly, the readable storage medium may have a similar structure, so to avoid repetition, the electronic device or the readable storage medium applied to the two methods are not separately described, that is, the electronic device and the readable storage medium described below may be applied to the model training method or the expression recognition method.

Fig. 6 is a block diagram of an electronic device according to the model training method or the expression recognition method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform a model training method or an expression recognition method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform a model training method or an expression recognition method provided herein.

The memory 602 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the model training method or the expression recognition method in the embodiments of the present application (for example, the sample image acquisition unit 401, the classification processing unit 402, the sampling probability determination unit 403, and the model training unit 404 shown in fig. 4, or the to-be-processed image determination unit 501 and the expression recognition unit 502 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, namely, implements the model training method or the expression recognition method in the above method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device by the model training method or the expression recognition method, or the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 may optionally include a memory remotely located from the processor 601, and these remote memories may be connected to the electronic device of the model training method or the expression recognition method through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the model training method or the expression recognition method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the model training method or the expression recognition method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme of the embodiment of the application, the sampling probability of the sample set can be determined based on the number of the sample images in the sample set, so that the problem of sample imbalance is solved from the sampling dimension, and a foundation is laid for subsequently improving the recognition rate of the expression recognition model.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A model training method, comprising:

2. The method according to claim 1, wherein a first loss function is arranged after a full connection layer of the expression recognition model and used for determining a loss value of feature information output by the full connection layer; the weight of the parameter for characterizing each type of expressive feature in the first loss function is associated with the sampling probability.

3. The method according to claim 1 or 2, wherein a last convolutional layer of the expression recognition model is followed by a second loss function for determining a loss value of the characteristic information output by the last convolutional layer.

4. The method of claim 3, wherein the parameters characterizing the various types of expressive features in the second loss function are provided with weights, and the weights of the parameters characterizing the various types of expressive features in the second loss function are associated with the sampling probability.

5. The method of claim 1, wherein the determining a sampling probability for the sample set based at least on a number of sample images in the sample set comprises:

obtaining the ratio of the sample set to all the sample images based on the ratio of the number of the sample images in the sample set to the total number of all the sample images;

determining a sampling probability for the sample set based on the ratio.

6. The method of claim 1, wherein the training of the surface recognition model with at least the training samples comprises:

labeling the expression characteristics of the training sample;

and at least inputting the training sample subjected to labeling processing into the expression recognition model to perform model training.

7. An expression recognition method, comprising:

acquiring a facial image to be subjected to expression recognition;

inputting the facial image into an expression recognition model, and outputting expression features matched with the facial image; wherein the expression recognition model is obtained after being trained by the method of any one of claims 1 to 6.

8. A model training apparatus comprising:

9. The device of claim 8, wherein a first loss function is arranged after a full connection layer of the expression recognition model and used for determining a loss value of feature information output by the full connection layer; the weight of the parameter for characterizing each type of expressive feature in the first loss function is associated with the sampling probability.

10. The apparatus according to claim 8 or 9, wherein a second loss function is provided after a last convolutional layer of the expression recognition model, and is used for determining a loss value of the characteristic information output by the last convolutional layer.

11. The apparatus of claim 10, wherein the parameters characterizing the various types of expressive features in the second loss function are provided with weights, and the weights of the parameters characterizing the various types of expressive features in the second loss function are associated with the sampling probability.

12. The apparatus of claim 8, wherein the sampling probability determination unit comprises:

13. The apparatus of claim 8, wherein the model training unit comprises:

14. An expression recognition apparatus comprising:

the expression recognition unit is used for outputting expression characteristics matched with the facial image after the facial image is input into an expression recognition model; wherein the expression recognition model is obtained after being trained by the method of any one of claims 1 to 6.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6; or, performing the method of claim 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6; or, performing the method of claim 7.