WO2021139316A1

WO2021139316A1 - Method and apparatus for establishing expression recognition model, and computer device and storage medium

Info

Publication number: WO2021139316A1
Application number: PCT/CN2020/122822
Authority: WO
Inventors: 张展望; 田笑; 周超勇; 刘玉宇
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-07-31
Filing date: 2020-10-22
Publication date: 2021-07-15
Also published as: CN111898550A; CN111898550B

Abstract

Disclosed are a method and apparatus for establishing an expression recognition model, and a computer device and a storage medium, belonging to the field of artificial intelligence. The method comprises: acquiring data of a plurality of first training images and data of a plurality of second training images (201); according to a feature value, acquiring clustering centers corresponding to data of a plurality of target images, and reference maps corresponding to the data of the plurality of target images (203); randomly extracting data of two target images of different reference maps as first input image data to obtain data of a plurality of first input images corresponding to the clustering centers; randomly extracting second input images corresponding to different clustering centers to obtain data of a plurality of second input images; and inputting the first input images, the second input images and the clustering centers corresponding to the first input images into an EmtionNet. In addition, the method also relates to blockchain technology. The data of the first training images and the data of the second training images can be stored in a blockchain, thereby improving the expression recognition accuracy.

Description

Method, device, computer equipment and storage medium for establishing facial expression recognition model

This application is based on the Chinese invention patent application filed on July 31, 2020 with the application number 202010761705.0, titled "Method, device, computer equipment and storage medium for establishing facial expression recognition model", and claims its priority.

Technical field

This application relates to the field of artificial intelligence, and in particular to a method, device, computer equipment and storage medium for establishing an expression recognition model.

Background technique

Facial expression recognition is an important field of artificial intelligence, and its application prospects in visual tasks are extremely wide. For example, in intelligent education, the emotions of students in the classroom are analyzed by loading expression recognition. Based on this, educators can analyze the enthusiasm of students in the classroom and the effectiveness of the classroom. Respond to the overall situation and individual student status in a timely manner, thereby guiding educators to flexibly change educational interaction and other methods to increase the conversion rate of educational achievements; it is also used in security, smart cities, online education, human-computer interaction, and crime analysis. In the 20th century, experts put forward seven types of basic expressions through cross-cultural research, namely, angry, scared, disgusted, happy, sad, surprised and neutral, and analyzed the current deep learning-based expression recognition methods. Generally, facial expression recognition requires face detection, face alignment, face normalization, deep feature learning, and facial expression classification. Finally, the probability of the current seven facial expressions is obtained through logistic regression (softmax), and the highest probability is the current expression. However, the inventor realized that the accuracy was not satisfactory. Use network integration such as adaboost to complement each other through the diversity of network models, and the improvement is obvious. Try different training functions. However, in terms of data driving, it is too difficult to obtain facial expression data, and the data annotators are highly subjective, such as fear and surprise are highly confusing, which will impact the model classification ability; the more advanced network structure is used, it is easy to lead to overfitting. Training skills are demanding.

Summary of the invention

The purpose of the embodiments of the present application is to propose a method, device, computer equipment, and storage medium for establishing an expression recognition model, so as to solve the problems of overfitting and low accuracy in identification recognition.

In order to solve the above technical problems, an embodiment of the present application provides a method for establishing an expression recognition model, which adopts the following technical solutions:

Acquiring multiple pieces of first training image data and multiple pieces of second training image data;

Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;

Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;

For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;

For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;

Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;

The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.

In order to solve the above technical problems, an embodiment of the present application also provides a device for establishing an expression recognition model, which adopts the following technical solutions:

The training data acquisition module is used to acquire multiple first training image data and multiple second training image data;

The residual neural network training module is used to train the residual neural network through the multiple first training image data and the multiple second training image data to obtain the target residual neural network and the multiple first training The feature value corresponding to the output of the image;

A reference image acquisition module, configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;

The clustering module is used to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the A group of paired first input image data corresponding to the cluster center;

The extraction module is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data to obtain the corresponding first input image data The second input image data;

An input module, configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;

The EmtionNet training module is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.

In order to solve the above technical problems, the embodiments of the present application also provide a computer device, which adopts the following technical solutions:

A computer device comprising at least one connected processor, a memory, and an input and output unit, wherein the memory is used to store computer readable instructions, and the processor is used to call the computer readable instructions in the memory to execute The steps of establishing an expression recognition model method are as follows:

In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:

A computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the steps of the method for establishing an expression recognition model as described below are realized:

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

This application proposes a new standard-based expression recognition method, which is different from the previous classification training methods. Instead, it first uses the training classification model on the face recognition training data, and then fine-tunes the classification model through the expression data. The method trains a classification model with good accuracy. This application uses the reference image as the basic image and the expression as the comparison input. The same expression feature and different expression features can be compared to overcome the classification drift and classification drift caused by the subjectivity of the annotation data. It also avoids the difficulty of training and the decrease of accuracy caused by the random basic graph method.

Description of the drawings

In order to explain the solution in this application more clearly, the following will briefly introduce the drawings used in the description of the embodiments of the application. Obviously, the drawings in the following description are some embodiments of the application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

Figure 1 is an exemplary system architecture diagram to which the present application can be applied;

Fig. 2 is a flowchart of an embodiment of the method for establishing an expression recognition model according to the present application;

Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for establishing an expression recognition model according to the present application;

Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the present application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings.

As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.

The user can use the

terminal devices

101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Various communication client applications, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social platform software, may be installed on the

terminal devices

101, 102, and 103.

The

terminal devices

101, 102, and 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.

The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the

terminal devices

101, 102, and 103.

It should be noted that the method for establishing an expression recognition model provided by the embodiments of the present application is generally executed by a server/terminal device. Accordingly, the apparatus for establishing an expression recognition model is generally set in the server/terminal device.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.

Continuing to refer to FIG. 2, a flowchart of an embodiment of the method for establishing an expression recognition model according to the present application is shown. The method for establishing an expression recognition model includes the following steps:

Step 201: Acquire multiple pieces of first training image data and multiple pieces of second training image data.

In this embodiment, the electronic device (such as the server/terminal device shown in FIG. 1) on which the method for establishing an expression recognition model runs can be calibrated by receiving a user request through a wired connection or a wireless connection. It should be pointed out that the above-mentioned wireless connection methods can include, but are not limited to, 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .

In this implementation, the first training image data can use MS+VGGface data, and the second training image can use seven types of expression data on the emotion network (EmotionNet). VGGFace is published by the Vision Group of Oxford University in 2015. VGGNet is also proposed by the vision group. Generally, face recognition based on VGGNet is used. In 2016, a data set containing millions of images appeared-EmotioNet. On this data set, methods such as deep learning can be used to estimate the intensity of expressions and the intensity of action units. However, it is important to note that although the scale of this expression data set is very large, it is not entirely manually annotated, but is annotated in a semi-automatic way, so there may be a lot of noise. How to make good use of such data is also worthy of attention.

Step 202: Train a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images .

In this embodiment, an initial residual neural network (Residual Network, ResNet50) is trained through the first training image data, the second training image data is fine-tuned to obtain the target ResNet50, and the logistic regression SoftMax layer of the target ResNet50 is removed, Input the plurality of first training image data into the target ResNet50, and obtain the corresponding output feature values of the plurality of first training images.

Since EmotionNet and MS+VGGface are both millions of image data levels, it is possible to obtain an accurate target residual neural network and the corresponding output feature values of the multiple first training images.

Step 203: Acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data, according to the feature value.

In this embodiment, the multiple pieces of target image data are feature values output by the target residual neural network, and the feature values are converted into image features used to describe the target image data. The target image data can be MS+VGGface or EmotionNet, by K-means clustering algorithm clusters k = 7, get together to give 7 cluster centers, each cluster center P _i for calculating an intersecting radius, calculated as _{R i (i = 1, ...} , 7) each R _i segmentation 8 parts, denoted _{R i, j (j = 1} , ..., 8), for each cluster center P _i, the radius R _{i, J} search for a human face in the data set EmotionNet The emoticon is used as the reference emoticon. In the end, 56 reference images will be searched, 8 reference emoji images of each type of expression, denoted as _Ai,j .

Step 204: For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the data corresponding to the cluster center. A corresponding set of paired first input image data.

In the present embodiment, training, randomly drawn from the reference focus an emoticon A _{i, j} as a reference image, such as A _{i, j} is happy face, corresponding to the A _i in _{EmtionNet, j} is a positive expression expression , And then find another image that belongs to the happy cluster center, but is not in a reference image, and two are input as the first input image. For a piece of target image data, one cluster center corresponds to one expression, and one expression has a set of paired first input images. The paired first input image refers to two reference images of the same cluster center.

In other embodiments of the present application, for a piece of target image data, the same cluster center can also be randomly selected, and three or more pieces of the target image data of different reference images are used as the first input image data. At this time, the paired first input images are multiple reference images of the same cluster center.

Step 205: For each paired first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain the first input image data corresponding to the first input image data. 2. Input image data.

In this embodiment, an expression of another cluster center is randomized. Taking the above example as an example, for example, anger is used as a feedback expression, and the corresponding unhappy expression in EmtionNet is a negative feedback input.

In other embodiments of the present application, the number of reference graphs corresponding to different cluster centers may be randomly selected as one, or two or more.

Therefore, corresponding to each target image data, at least three reference images are randomly selected as input data and input to EmtionNet for training.

Step 206: Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet.

In this embodiment, this information is input to the neural network for training.

Step 207: Train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.

In this embodiment, through different usual ternary loss function training methods, the reference image is a fixed 56 reference image, which solves the problem of training instability and sample contamination. The ternary loss function is L=max(d(a,p)-d(a,n)+m arg in,0). Among them, d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, and margin is the hyperparameter.

This application proposes a new standard-based facial expression recognition method, which is different from the previous classification training methods. Instead, a loss function is used to train a model on the face recognition training data, and then a linear regression function is used to fine-tune the facial expression data. , In this way, a classification model with good accuracy has been trained. This model is used to perform 7 clustering on the expression data, and calculate the class radius according to the clustering results to obtain 56 reference expression images, 8 of each expression, reference image Will be used as the base map of the ternary loss function; different from the previous ternary loss function training to randomly set the base map, this article uses the reference map as the base map to overcome the classification drift and errors caused by the subjectivity of the labeled data, and also avoid the random basis The problem of difficulty in training and decreased accuracy caused by the graph method.

In some optional implementation manners, the residual neural network is trained through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the plurality of second training image data. The steps of corresponding output feature values of a training image specifically include:

Training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network;

Acquiring second training image data, fine-tuning the trained residual neural network through the second training image data, to obtain a target residual neural network;

The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.

In the above embodiment, using face recognition MS+VGGface data, train ResNet50 through the loss function, and then perform migration learning training on the EmotionNet expression data on the expression data. The training includes the softmax layer. After the first input image is input, the softmax is removed by removing the softmax. The layer can obtain the feature value of each first input image, and the feature value of each image can be obtained in the above-mentioned manner, so that the feature value can be used to describe each image.

In some optional implementation manners, the step of training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network specifically includes:

Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;

Inputting the first training image data and the corresponding label to the initial residual neural network;

by

Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,

Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;

Deploy the trained residual neural network to the client.

In the above-mentioned embodiment, by adding m in the formula as an angle, the angle between the same kind is forcibly enlarged, making the neural network work harder to tighten the same kind. Normalize x and W, calculate the prediction vector

Pick the correct value from cos(θ _j +i), calculate its arc cosine to get the angle, add m to the angle, and get the pick from

Pick out the correct value and the one-hot code at the location, and set

By putting the one-hot code back to the original position, multiplying all values by the fixed value s, the EmotionNet neural network can be trained by the above method, and a better training model can be obtained.

In some optional implementation manners, according to the feature value, obtaining multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data Before the step, it also includes:

Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;

Preset a first preset value m;

Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.

In the above embodiment, the purpose of clustering is also to classify data, but it is not known how to distinguish in advance. By judging the similarity between various pieces of data, the similar ones are put together. Clustering is an unsupervised problem. The output data has no label value, and the machine algorithm needs to explore the law by itself, and divide the similar data into one category according to the law. The K-Means algorithm is the most classic partition-based clustering method, and it is one of the ten classic data mining algorithms. Simply put, K-Means is a method of dividing data into K parts without any supervision signal. Clustering algorithm is the most common one in unsupervised learning. Given a set of data, a clustering algorithm is needed to mine the hidden information in the data. Through clustering, images with similar feature values can be put together to achieve a preliminary distinction. the goal of.

In some optional implementation manners, the step of training the EmtionNet through a ternary loss function specifically includes:

Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;

Deploy the trained EmtionNet to the client.

In the above embodiment, through the above method, the input image contains three images, one is the image of the basic cluster center, the other is the image of the same cluster center, and the last is the image of different cluster centers. . a is the image of the basic cluster center, p is the image of the same cluster center, and n is the image of different cluster centers. The goal can be optimized, so that the distance between a and p is shortened, and the distance between a and n is extended.

In some optional implementation manners, after the step of training the EmtionNet through a ternary loss function, the method further includes:

Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;

Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;

If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;

The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.

In the foregoing embodiment, if the expression label is different from the corresponding expression recognition result, the recognition result corresponding to the test set image is set as an error; each test set image is marked with a corresponding expression label and a corresponding benchmark Picture, taking happy as the input image as an example, select a happy picture as the input image, then select a different reference picture, and the same happy picture as the first input image, and then select an unhappy picture as the input image, Input to the model for testing, if the result is happy, the recognition is correct, if not, the recognition is wrong, and the accuracy of the model is preliminarily estimated by recognizing all the images of the test set.

The counting the number of correct recognition results, and calculating the percentage of the number of correct recognition results to the number of emoticon tags, as the accuracy of the EmtionNet, further includes:

If the accuracy of the EmtionNet is lower than the preset accuracy, adjust the parameters in the EmtionNet model and retrain.

In the foregoing embodiment, if the accuracy rate is too low, the neural network parameters are adjusted and retrained to obtain new neuron weights to improve the accuracy of recognition.

It should be emphasized that in order to further ensure the privacy and security of the multiple first training image data and the multiple second training image data, the multiple first training image data and the multiple second training image data Image data can also be stored in a node of a blockchain.

The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a computer-readable storage medium. When the process is executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

It should be understood that although the various steps in the flowchart of the drawings are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least part of the steps in the flowchart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

With further reference to FIG. 3, as an implementation of the method shown in FIG. 2, this application provides an embodiment of a device for establishing an expression recognition model. The device embodiment corresponds to the method embodiment shown in FIG. Specifically, it can be applied to various electronic devices.

As shown in FIG. 3, the apparatus 300 for establishing an expression recognition model in this embodiment includes: a training data acquisition module 301, a residual neural network training module 302, a reference map acquisition module 303, a clustering module 304, an extraction module 305, and input Module 306 and EmtionNet training module 305. among them:

The training data acquisition module 301 is used to acquire multiple pieces of first training image data and multiple pieces of second training image data;

The residual neural network training module 302 is configured to train a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain a target residual neural network and the plurality of first training images. The corresponding output feature value of the image;

The reference image acquisition module 303 is configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;

The clustering module 304 is configured to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the same clustering center. A group of paired first input image data corresponding to the class center;

The extraction module 305 is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data, to obtain the data corresponding to the first input image data The second input image data;

The input module 306 is configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;

The EmtionNet training module 307 is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.

In some optional implementation manners of this embodiment, the above-mentioned residual neural network training module is further used for:

by

Deploy the trained residual neural network to the client.

In some optional implementation manners of this embodiment, the above-mentioned apparatus 300 further includes: a clustering module is configured to:

Preset a first preset value m;

In some optional implementations of this embodiment, the above-mentioned EmtionNet training module is further used for:

Deploy the trained EmtionNet to the client.

In some optional implementation manners of this embodiment, the above-mentioned apparatus 300 further includes: a test module for:

In some optional implementation manners of this embodiment, the above-mentioned apparatus 300 further includes: a debugging module for:

In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.

The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with components 41-43, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

The memory 41 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc., the computer readable storage The medium can be non-volatile or volatile. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk equipped on the computer device 6, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as computer-readable instructions for establishing an expression recognition model method. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 42 is generally used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run computer-readable instructions or processed data stored in the memory 41, for example, run the computer-readable instructions of the method for establishing an expression recognition model.

The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.

The present application also provides another implementation manner, that is, a computer-readable storage medium is provided with computer-readable instructions stored thereon, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to execute the steps of the method for establishing an expression recognition model as described above.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims

A method for establishing an expression recognition model includes the following steps:

Acquiring multiple pieces of first training image data and multiple pieces of second training image data;

Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;

Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;

For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;

For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;

Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;

The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
The method for establishing an expression recognition model according to claim 1, wherein the residual neural network is trained through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network And the step of outputting feature values corresponding to the plurality of first training images specifically includes:

The initial residual neural network is trained through the plurality of first training image data to obtain a trained residual neural network; the second training image data is acquired, and the trained residual neural network is fine-tuned through the second training image data Network, get the target residual neural network;

The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
The method for establishing an expression recognition model according to claim 2, wherein the step of training an initial residual neural network through the plurality of first training image data to obtain a trained residual neural network specifically comprises:

Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;

Inputting the first training image data and the corresponding label to the initial residual neural network;

by
Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;

Deploy the trained residual neural network to the client.
The method for establishing an expression recognition model according to any one of claims 1-3, wherein said acquiring multiple pieces of target image data, clustering centers corresponding to the multiple pieces of target image data, and all pieces of target image data according to the feature value Before the step of describing the reference images corresponding to the multiple pieces of target image data, it also includes:

Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;

Preset a first preset value m;

Clustering the first preset number of cluster centers for each of the cluster centers by a k-means clustering algorithm, to obtain m reference images corresponding to each of the cluster centers.
The method for establishing an expression recognition model according to claim 4, wherein the step of training the EmtionNet through a ternary loss function specifically comprises:

Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;

Deploy the trained EmtionNet to the client.
The method for establishing an expression recognition model according to claim 5, wherein after the step of training the EmtionNet through a ternary loss function, the method further comprises:

Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;

Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;

If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;

The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
The method for establishing an expression recognition model according to claim 6, wherein the number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of expression tags is calculated as the accuracy of the EmtionNet Also includes:

If the accuracy of the EmtionNet is lower than the preset accuracy, adjust the parameters in the EmtionNet model and retrain.
An apparatus for establishing an expression recognition model, which includes:

The training data acquisition module is used to acquire multiple first training image data and multiple second training image data;

The residual neural network training module is used to train the residual neural network through the multiple first training image data and the multiple second training image data to obtain the target residual neural network and the multiple first training The feature value corresponding to the output of the image;

A reference image acquisition module, configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;

The clustering module is used to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the same clustering center. A group of paired first input image data corresponding to the class center;

The extraction module is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data to obtain the corresponding first input image data The second input image data;

An input module, configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;

The EmtionNet training module is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
A computer device includes a memory and a processor, and computer-readable instructions are stored in the memory, wherein, when the processor executes the computer-readable instructions, the steps of the method for establishing an expression recognition model as described below are implemented:

Acquiring multiple pieces of first training image data and multiple pieces of second training image data;

Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;

Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;

For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;

For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;

Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;

The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
The computer device according to claim 9, wherein the residual neural network is trained by the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the The steps of corresponding output feature values of the plurality of first training images specifically include:

The initial residual neural network is trained through the plurality of first training image data to obtain a trained residual neural network; the second training image data is acquired, and the trained residual neural network is fine-tuned through the second training image data Network, get the target residual neural network;

The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
10. The computer device of claim 10, wherein the step of training an initial residual neural network through the plurality of first training image data to obtain a trained residual neural network specifically comprises:

Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;

Inputting the first training image data and the corresponding label to the initial residual neural network;

by
Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;

Deploy the trained residual neural network to the client.
The computer device according to any one of claims 9-11, wherein the multiple pieces of target image data, the cluster centers corresponding to the multiple pieces of target image data, and the multiple pieces of target image data are obtained according to the characteristic value. Before the step of the reference image corresponding to the target image data, it also includes:

Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;

Preset a first preset value m;

Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.
The computer device according to claim 12, wherein the step of training the EmtionNet through a ternary loss function specifically comprises:

Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;

Deploy the trained EmtionNet to the client.
The computer device according to claim 13, wherein after the step of training the EmtionNet through a ternary loss function, the method further comprises:

Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;

Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;

If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;

The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
A computer-readable storage medium, wherein computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the method for establishing an expression recognition model as described below are realized:

Acquiring multiple pieces of first training image data and multiple pieces of second training image data;

Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;

Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;

For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;

For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;

Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;

The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
The computer-readable storage medium according to claim 15, wherein the residual neural network is trained by the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network And the step of outputting feature values corresponding to the plurality of first training images specifically includes:

Training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network;

Acquiring second training image data, fine-tuning the trained residual neural network through the second training image data, to obtain a target residual neural network;

The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
15. The computer-readable storage medium of claim 16, wherein the step of training an initial residual neural network through the plurality of first training image data to obtain a trained residual neural network specifically comprises:

Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;

Inputting the first training image data and the corresponding label to the initial residual neural network;

by
Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;

Deploy the trained residual neural network to the client.
The computer-readable storage medium according to any one of claims 15-17, wherein the acquiring multiple pieces of target image data, the cluster centers corresponding to the multiple pieces of target image data, and all the pieces of target image data according to the characteristic value Before the step of describing the reference images corresponding to the multiple pieces of target image data, it also includes:

Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;

Preset a first preset value m;

Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.
18. The computer-readable storage medium of claim 18, wherein the step of training the EmtionNet through a ternary loss function specifically comprises:

Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;

Deploy the trained EmtionNet to the client.
19. The computer-readable storage medium of claim 19, wherein after the step of training the EmtionNet through a ternary loss function, the method further comprises:

Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;

Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;

If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;

The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.