WO2021139316A1 - Method and apparatus for establishing expression recognition model, and computer device and storage medium - Google Patents

Method and apparatus for establishing expression recognition model, and computer device and storage medium Download PDF

Info

Publication number
WO2021139316A1
WO2021139316A1 PCT/CN2020/122822 CN2020122822W WO2021139316A1 WO 2021139316 A1 WO2021139316 A1 WO 2021139316A1 CN 2020122822 W CN2020122822 W CN 2020122822W WO 2021139316 A1 WO2021139316 A1 WO 2021139316A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
training
neural network
residual neural
emtionnet
Prior art date
Application number
PCT/CN2020/122822
Other languages
French (fr)
Chinese (zh)
Inventor
张展望
田笑
周超勇
刘玉宇
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139316A1 publication Critical patent/WO2021139316A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method, device, computer equipment and storage medium for establishing an expression recognition model.
  • Facial expression recognition is an important field of artificial intelligence, and its application prospects in visual tasks are extremely wide.
  • intelligent education the emotions of students in the classroom are analyzed by loading expression recognition. Based on this, educators can analyze the enthusiasm of students in the classroom and the effectiveness of the classroom. Respond to the overall situation and individual student status in a timely manner, thereby guiding educators to flexibly change educational interaction and other methods to increase the conversion rate of educational achievements; it is also used in security, smart cities, online education, human-computer interaction, and crime analysis.
  • experts put forward seven types of basic expressions through cross-cultural research, namely, angry, scared, disgusted, happy, sad, surprised and neutral, and analyzed the current deep learning-based expression recognition methods.
  • facial expression recognition requires face detection, face alignment, face normalization, deep feature learning, and facial expression classification.
  • the probability of the current seven facial expressions is obtained through logistic regression (softmax), and the highest probability is the current expression.
  • softmax logistic regression
  • Use network integration such as adaboost to complement each other through the diversity of network models, and the improvement is obvious. Try different training functions.
  • it is too difficult to obtain facial expression data and the data annotators are highly subjective, such as fear and surprise are highly confusing, which will impact the model classification ability; the more advanced network structure is used, it is easy to lead to overfitting. Training skills are demanding.
  • the purpose of the embodiments of the present application is to propose a method, device, computer equipment, and storage medium for establishing an expression recognition model, so as to solve the problems of overfitting and low accuracy in identification recognition.
  • an embodiment of the present application provides a method for establishing an expression recognition model, which adopts the following technical solutions:
  • the target image data For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center.
  • the EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
  • an embodiment of the present application also provides a device for establishing an expression recognition model, which adopts the following technical solutions:
  • the training data acquisition module is used to acquire multiple first training image data and multiple second training image data
  • the residual neural network training module is used to train the residual neural network through the multiple first training image data and the multiple second training image data to obtain the target residual neural network and the multiple first training The feature value corresponding to the output of the image;
  • a reference image acquisition module configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;
  • the clustering module is used to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the A group of paired first input image data corresponding to the cluster center;
  • the extraction module is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data to obtain the corresponding first input image data The second input image data;
  • An input module configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;
  • the EmtionNet training module is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
  • the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
  • a computer device comprising at least one connected processor, a memory, and an input and output unit, wherein the memory is used to store computer readable instructions, and the processor is used to call the computer readable instructions in the memory to execute
  • the steps of establishing an expression recognition model method are as follows:
  • the target image data For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center.
  • the EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • a computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the steps of the method for establishing an expression recognition model as described below are realized:
  • the target image data For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center.
  • the EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
  • This application proposes a new standard-based expression recognition method, which is different from the previous classification training methods. Instead, it first uses the training classification model on the face recognition training data, and then fine-tunes the classification model through the expression data. The method trains a classification model with good accuracy.
  • This application uses the reference image as the basic image and the expression as the comparison input. The same expression feature and different expression features can be compared to overcome the classification drift and classification drift caused by the subjectivity of the annotation data. It also avoids the difficulty of training and the decrease of accuracy caused by the random basic graph method.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is a flowchart of an embodiment of the method for establishing an expression recognition model according to the present application
  • Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for establishing an expression recognition model according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social platform software, may be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, and 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4
  • laptop portable computers and desktop computers etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • the method for establishing an expression recognition model provided by the embodiments of the present application is generally executed by a server/terminal device. Accordingly, the apparatus for establishing an expression recognition model is generally set in the server/terminal device.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the method for establishing an expression recognition model includes the following steps:
  • Step 201 Acquire multiple pieces of first training image data and multiple pieces of second training image data.
  • the electronic device (such as the server/terminal device shown in FIG. 1) on which the method for establishing an expression recognition model runs can be calibrated by receiving a user request through a wired connection or a wireless connection.
  • the above-mentioned wireless connection methods can include, but are not limited to, 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
  • the first training image data can use MS+VGGface data
  • the second training image can use seven types of expression data on the emotion network (EmotionNet).
  • EmotionNet emotion network
  • VGGFace is published by the Vision Group of Oxford University in 2015.
  • VGGNet is also proposed by the vision group.
  • face recognition based on VGGNet is used.
  • a data set containing millions of images appeared-EmotioNet.
  • methods such as deep learning can be used to estimate the intensity of expressions and the intensity of action units.
  • the scale of this expression data set is very large, it is not entirely manually annotated, but is annotated in a semi-automatic way, so there may be a lot of noise. How to make good use of such data is also worthy of attention.
  • Step 202 Train a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images .
  • an initial residual neural network (Residual Network, ResNet50) is trained through the first training image data, the second training image data is fine-tuned to obtain the target ResNet50, and the logistic regression SoftMax layer of the target ResNet50 is removed, Input the plurality of first training image data into the target ResNet50, and obtain the corresponding output feature values of the plurality of first training images.
  • ResNet50 residual neural network
  • Step 203 Acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data, according to the feature value.
  • the multiple pieces of target image data are feature values output by the target residual neural network, and the feature values are converted into image features used to describe the target image data.
  • the emoticon is used as the reference emoticon. In the end, 56 reference images will be searched, 8 reference emoji images of each type of expression, denoted as Ai,j .
  • Step 204 For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the data corresponding to the cluster center. A corresponding set of paired first input image data.
  • training randomly drawn from the reference focus an emoticon A i, j as a reference image, such as A i, j is happy face, corresponding to the A i in EmtionNet, j is a positive expression expression , And then find another image that belongs to the happy cluster center, but is not in a reference image, and two are input as the first input image.
  • a i, j is happy face, corresponding to the A i in EmtionNet
  • j is a positive expression expression
  • one cluster center corresponds to one expression
  • one expression has a set of paired first input images.
  • the paired first input image refers to two reference images of the same cluster center.
  • the same cluster center can also be randomly selected, and three or more pieces of the target image data of different reference images are used as the first input image data.
  • the paired first input images are multiple reference images of the same cluster center.
  • Step 205 For each paired first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain the first input image data corresponding to the first input image data. 2. Input image data.
  • an expression of another cluster center is randomized.
  • anger is used as a feedback expression
  • the corresponding unhappy expression in EmtionNet is a negative feedback input.
  • the number of reference graphs corresponding to different cluster centers may be randomly selected as one, or two or more.
  • At least three reference images are randomly selected as input data and input to EmtionNet for training.
  • Step 206 Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet.
  • this information is input to the neural network for training.
  • Step 207 Train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
  • the reference image is a fixed 56 reference image, which solves the problem of training instability and sample contamination.
  • This application proposes a new standard-based facial expression recognition method, which is different from the previous classification training methods. Instead, a loss function is used to train a model on the face recognition training data, and then a linear regression function is used to fine-tune the facial expression data. , In this way, a classification model with good accuracy has been trained.
  • This model is used to perform 7 clustering on the expression data, and calculate the class radius according to the clustering results to obtain 56 reference expression images, 8 of each expression, reference image Will be used as the base map of the ternary loss function; different from the previous ternary loss function training to randomly set the base map, this article uses the reference map as the base map to overcome the classification drift and errors caused by the subjectivity of the labeled data, and also avoid the random basis The problem of difficulty in training and decreased accuracy caused by the graph method.
  • the residual neural network is trained through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the plurality of second training image data.
  • the steps of corresponding output feature values of a training image specifically include:
  • the logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
  • the training includes the softmax layer. After the first input image is input, the softmax is removed by removing the softmax. The layer can obtain the feature value of each first input image, and the feature value of each image can be obtained in the above-mentioned manner, so that the feature value can be used to describe each image.
  • the step of training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network specifically includes:
  • the purpose of clustering is also to classify data, but it is not known how to distinguish in advance. By judging the similarity between various pieces of data, the similar ones are put together.
  • Clustering is an unsupervised problem. The output data has no label value, and the machine algorithm needs to explore the law by itself, and divide the similar data into one category according to the law.
  • the K-Means algorithm is the most classic partition-based clustering method, and it is one of the ten classic data mining algorithms. Simply put, K-Means is a method of dividing data into K parts without any supervision signal.
  • Clustering algorithm is the most common one in unsupervised learning. Given a set of data, a clustering algorithm is needed to mine the hidden information in the data. Through clustering, images with similar feature values can be put together to achieve a preliminary distinction. the goal of.
  • the step of training the EmtionNet through a ternary loss function specifically includes:
  • the input image contains three images, one is the image of the basic cluster center, the other is the image of the same cluster center, and the last is the image of different cluster centers.
  • a is the image of the basic cluster center
  • p is the image of the same cluster center
  • n is the image of different cluster centers. The goal can be optimized, so that the distance between a and p is shortened, and the distance between a and n is extended.
  • the method further includes:
  • the recognition result corresponding to the test set image is set to be correct
  • the number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
  • the recognition result corresponding to the test set image is set as an error; each test set image is marked with a corresponding expression label and a corresponding benchmark Picture, taking happy as the input image as an example, select a happy picture as the input image, then select a different reference picture, and the same happy picture as the first input image, and then select an unhappy picture as the input image, Input to the model for testing, if the result is happy, the recognition is correct, if not, the recognition is wrong, and the accuracy of the model is preliminarily estimated by recognizing all the images of the test set.
  • the counting the number of correct recognition results, and calculating the percentage of the number of correct recognition results to the number of emoticon tags, as the accuracy of the EmtionNet further includes:
  • the neural network parameters are adjusted and retrained to obtain new neuron weights to improve the accuracy of recognition.
  • the multiple first training image data and the multiple second training image data Image data can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • this application provides an embodiment of a device for establishing an expression recognition model.
  • the device embodiment corresponds to the method embodiment shown in FIG. Specifically, it can be applied to various electronic devices.
  • the apparatus 300 for establishing an expression recognition model in this embodiment includes: a training data acquisition module 301, a residual neural network training module 302, a reference map acquisition module 303, a clustering module 304, an extraction module 305, and input Module 306 and EmtionNet training module 305. among them:
  • the training data acquisition module 301 is used to acquire multiple pieces of first training image data and multiple pieces of second training image data;
  • the residual neural network training module 302 is configured to train a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain a target residual neural network and the plurality of first training images. The corresponding output feature value of the image;
  • the reference image acquisition module 303 is configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;
  • the clustering module 304 is configured to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the same clustering center. A group of paired first input image data corresponding to the class center;
  • the extraction module 305 is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data, to obtain the data corresponding to the first input image data The second input image data;
  • the input module 306 is configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;
  • the EmtionNet training module 307 is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
  • the above-mentioned residual neural network training module is further used for:
  • the logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
  • the above-mentioned residual neural network training module is further used for:
  • the above-mentioned apparatus 300 further includes: a clustering module is configured to:
  • EmtionNet training module is further used for:
  • the above-mentioned apparatus 300 further includes: a test module for:
  • the recognition result corresponding to the test set image is set to be correct
  • the number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
  • the above-mentioned apparatus 300 further includes: a debugging module for:
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with components 41-43, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable GateArray
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 41 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc., the computer readable storage
  • the medium can be non-volatile or volatile.
  • the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4.
  • the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk equipped on the computer device 6, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as computer-readable instructions for establishing an expression recognition model method.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 42 is generally used to control the overall operation of the computer device 4.
  • the processor 42 is configured to run computer-readable instructions or processed data stored in the memory 41, for example, run the computer-readable instructions of the method for establishing an expression recognition model.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the present application also provides another implementation manner, that is, a computer-readable storage medium is provided with computer-readable instructions stored thereon, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to execute the steps of the method for establishing an expression recognition model as described above.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a method and apparatus for establishing an expression recognition model, and a computer device and a storage medium, belonging to the field of artificial intelligence. The method comprises: acquiring data of a plurality of first training images and data of a plurality of second training images (201); according to a feature value, acquiring clustering centers corresponding to data of a plurality of target images, and reference maps corresponding to the data of the plurality of target images (203); randomly extracting data of two target images of different reference maps as first input image data to obtain data of a plurality of first input images corresponding to the clustering centers; randomly extracting second input images corresponding to different clustering centers to obtain data of a plurality of second input images; and inputting the first input images, the second input images and the clustering centers corresponding to the first input images into an EmtionNet. In addition, the method also relates to blockchain technology. The data of the first training images and the data of the second training images can be stored in a blockchain, thereby improving the expression recognition accuracy.

Description

建立表情识别模型方法、装置、计算机设备及存储介质Method, device, computer equipment and storage medium for establishing facial expression recognition model
本申请以2020年07月31日提交的申请号为202010761705.0,名称为“建立表情识别模型方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on July 31, 2020 with the application number 202010761705.0, titled "Method, device, computer equipment and storage medium for establishing facial expression recognition model", and claims its priority.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种建立表情识别模型方法、装置、计算机设备及存储介质。This application relates to the field of artificial intelligence, and in particular to a method, device, computer equipment and storage medium for establishing an expression recognition model.
背景技术Background technique
人脸表情识别是人工智能重要领域,在视觉任务中,应用前景极其广泛;比如在智能教育中,通过载入表情识别分析课堂学生情绪,教育者基于此分析出学生课堂积极性及课堂成效并掌握全局和个别学生状态及时做出应对,从而指导教育者灵活变动教育互动等方式,提升教育成果转化率;同样应用于安防、智慧城市、在线教育、人机互动和犯罪分析等领域。在20世专家就通过跨文化调研提出了七类基础表情,分别是生气,害怕,厌恶,开心,悲伤,惊讶以及中立,分析当前基于深度学习的表情识别方法。通常表情识别需要人脸检测、人脸对齐、人脸归一化、深度特征学习和表情分类最终通过逻辑回归(softmax)获得当前七种人脸表情的概率,概率最高的为当前表情。然而,发明人意识到精度不尽人意。采用网络集成比如adaboost,通过网络模型多样性进行互补,提升明显。尝试不同的训练函数。但在数据驱动方面,非常用表情数据获取难度过大,数据标注人为主观性强,比如害怕和惊讶混淆性强,这将冲击模型分类能力;采用越先进的网络结构很容易导致过拟合,训练技巧要求高。Facial expression recognition is an important field of artificial intelligence, and its application prospects in visual tasks are extremely wide. For example, in intelligent education, the emotions of students in the classroom are analyzed by loading expression recognition. Based on this, educators can analyze the enthusiasm of students in the classroom and the effectiveness of the classroom. Respond to the overall situation and individual student status in a timely manner, thereby guiding educators to flexibly change educational interaction and other methods to increase the conversion rate of educational achievements; it is also used in security, smart cities, online education, human-computer interaction, and crime analysis. In the 20th century, experts put forward seven types of basic expressions through cross-cultural research, namely, angry, scared, disgusted, happy, sad, surprised and neutral, and analyzed the current deep learning-based expression recognition methods. Generally, facial expression recognition requires face detection, face alignment, face normalization, deep feature learning, and facial expression classification. Finally, the probability of the current seven facial expressions is obtained through logistic regression (softmax), and the highest probability is the current expression. However, the inventor realized that the accuracy was not satisfactory. Use network integration such as adaboost to complement each other through the diversity of network models, and the improvement is obvious. Try different training functions. However, in terms of data driving, it is too difficult to obtain facial expression data, and the data annotators are highly subjective, such as fear and surprise are highly confusing, which will impact the model classification ability; the more advanced network structure is used, it is easy to lead to overfitting. Training skills are demanding.
发明内容Summary of the invention
本申请实施例的目的在于提出一种建立表情识别模型方法、装置、计算机设备及存储介质,以解决标识识别中的过拟合以及精度过低的问题。The purpose of the embodiments of the present application is to propose a method, device, computer equipment, and storage medium for establishing an expression recognition model, so as to solve the problems of overfitting and low accuracy in identification recognition.
为了解决上述技术问题,本申请实施例提供一种建立表情识别模型方法,采用了如下所述的技术方案:In order to solve the above technical problems, an embodiment of the present application provides a method for establishing an expression recognition model, which adopts the following technical solutions:
获取多张第一训练图像数据以及多张第二训练图像数据;Acquiring multiple pieces of first training image data and multiple pieces of second training image data;
通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;
根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;
为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;
为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;
将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;
通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
为了解决上述技术问题,本申请实施例还提供一种建立表情识别模型装置,采用了如下所述的技术方案:In order to solve the above technical problems, an embodiment of the present application also provides a device for establishing an expression recognition model, which adopts the following technical solutions:
训练数据获取模块,用于获取多张第一训练图像数据以及多张第二训练图像数据;The training data acquisition module is used to acquire multiple first training image data and multiple second training image data;
残差神经网络训练模块,用于通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;The residual neural network training module is used to train the residual neural network through the multiple first training image data and the multiple second training image data to obtain the target residual neural network and the multiple first training The feature value corresponding to the output of the image;
基准图获取模块,用于根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;A reference image acquisition module, configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;
聚类模块,用于为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;The clustering module is used to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the A group of paired first input image data corresponding to the cluster center;
抽取模块,用于为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;The extraction module is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data to obtain the corresponding first input image data The second input image data;
输入模块,用于将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;An input module, configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;
EmtionNet训练模块,用于通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet training module is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
一种计算机设备,其包括至少一个连接的处理器、存储器、输入输出单元,其中,所述存储器用于存储计算机可读指令,所述处理器用于调用所述存储器中的计算机可读指令来执行如下所述的建立表情识别模型方法的步骤:A computer device comprising at least one connected processor, a memory, and an input and output unit, wherein the memory is used to store computer readable instructions, and the processor is used to call the computer readable instructions in the memory to execute The steps of establishing an expression recognition model method are as follows:
获取多张第一训练图像数据以及多张第二训练图像数据;Acquiring multiple pieces of first training image data and multiple pieces of second training image data;
通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;
根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;
为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;
为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;
将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;
通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的建立表情识别模型方法的步骤:A computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the steps of the method for establishing an expression recognition model as described below are realized:
获取多张第一训练图像数据以及多张第二训练图像数据;Acquiring multiple pieces of first training image data and multiple pieces of second training image data;
通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;
根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;
为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第 一输入图像数据;For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;
为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;
将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;
通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
本申请提出一种新的基于标准性的表情识别方法,不同于以往分类训练方法,而是先在人脸识别训练数据上使用训练分类模型,再通过表情数据对分类模型进行微调,通过这种方式训练出一个精度不错的分类模型,本申请在通过使用基准图作为基础图和对表情作为对比输入,可以通过对比相同表情特征和不同表情特征,克服由于标注数据的主观性导致的分类漂移和错误,也避免随机基础图方法导致的难于训练和精度下降的问题。This application proposes a new standard-based expression recognition method, which is different from the previous classification training methods. Instead, it first uses the training classification model on the face recognition training data, and then fine-tunes the classification model through the expression data. The method trains a classification model with good accuracy. This application uses the reference image as the basic image and the expression as the comparison input. The same expression feature and different expression features can be compared to overcome the classification drift and classification drift caused by the subjectivity of the annotation data. It also avoids the difficulty of training and the decrease of accuracy caused by the random basic graph method.
附图说明Description of the drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the solution in this application more clearly, the following will briefly introduce the drawings used in the description of the embodiments of the application. Obviously, the drawings in the following description are some embodiments of the application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1是本申请可以应用于其中的示例性系统架构图;Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
图2根据本申请的建立表情识别模型方法的一个实施例的流程图;Fig. 2 is a flowchart of an embodiment of the method for establishing an expression recognition model according to the present application;
图3是根据本申请的建立表情识别模型装置的一个实施例的结构示意图;Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for establishing an expression recognition model according to the present application;
图4是根据本申请的计算机设备的一个实施例的结构示意图。Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the present application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Various communication client applications, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social platform software, may be installed on the terminal devices 101, 102, and 103.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group  Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, and 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
需要说明的是,本申请实施例所提供的建立表情识别模型方法一般由服务器/终端设备执行,相应地,建立表情识别模型装置一般设置于服务器/终端设备中。It should be noted that the method for establishing an expression recognition model provided by the embodiments of the present application is generally executed by a server/terminal device. Accordingly, the apparatus for establishing an expression recognition model is generally set in the server/terminal device.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
继续参考图2,示出了根据本申请的建立表情识别模型的方法的一个实施例的流程图。所述的建立表情识别模型方法,包括以下步骤:Continuing to refer to FIG. 2, a flowchart of an embodiment of the method for establishing an expression recognition model according to the present application is shown. The method for establishing an expression recognition model includes the following steps:
步骤201,获取多张第一训练图像数据以及多张第二训练图像数据。Step 201: Acquire multiple pieces of first training image data and multiple pieces of second training image data.
在本实施例中,建立表情识别模型方法运行于其上的电子设备(例如图1所示的服务器/终端设备)可以通过有线连接方式或者无线连接方式服务器接收用户请求进行标定。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。In this embodiment, the electronic device (such as the server/terminal device shown in FIG. 1) on which the method for establishing an expression recognition model runs can be calibrated by receiving a user request through a wired connection or a wireless connection. It should be pointed out that the above-mentioned wireless connection methods can include, but are not limited to, 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
在本实施中,第一训练图像数据可以使用MS+VGGface数据,第二训练图像可以使用情感网络(EmotionNet)上的七类表情数据。VGGFace是牛津大学视觉组于2015年发表,VGGNet也是该视觉组提出的,一般应用基于VGGNet的人脸识别。2016年出现了一个包含百万图像的数据集——EmotioNet。在该数据集上,可以采用深度学习这类方法做更多表情强度的估计和动作单元强度的估计。不过,需要特别注意的是,尽管这个表情数据集规模非常大,但它并不是完全由手工标注,而是通过半自动的方式标注的,所以可能存在很多噪声。如何利用好这样的数据也是值得关注的。In this implementation, the first training image data can use MS+VGGface data, and the second training image can use seven types of expression data on the emotion network (EmotionNet). VGGFace is published by the Vision Group of Oxford University in 2015. VGGNet is also proposed by the vision group. Generally, face recognition based on VGGNet is used. In 2016, a data set containing millions of images appeared-EmotioNet. On this data set, methods such as deep learning can be used to estimate the intensity of expressions and the intensity of action units. However, it is important to note that although the scale of this expression data set is very large, it is not entirely manually annotated, but is annotated in a semi-automatic way, so there may be a lot of noise. How to make good use of such data is also worthy of attention.
步骤202,通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值。Step 202: Train a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images .
在本实施例中,通过第一训练图像数据训练初始残差神经网络(Residual Network,ResNet50),通过所述第二训练图像数据微调,得到目标ResNet50,去除所述目标ResNet50的逻辑回归SoftMax层,将所述多张第一训练图像数据输入至所述目标ResNet50,得到所述多张第一训练图像对应输出的特征值。In this embodiment, an initial residual neural network (Residual Network, ResNet50) is trained through the first training image data, the second training image data is fine-tuned to obtain the target ResNet50, and the logistic regression SoftMax layer of the target ResNet50 is removed, Input the plurality of first training image data into the target ResNet50, and obtain the corresponding output feature values of the plurality of first training images.
由于EmotionNet和MS+VGGface都为百万级的图像数据级,因此能够得到精确的目标残差神经网络以及所述多张第一训练图像对应输出的特征值。Since EmotionNet and MS+VGGface are both millions of image data levels, it is possible to obtain an accurate target residual neural network and the corresponding output feature values of the multiple first training images.
步骤203,根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图。Step 203: Acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data, according to the feature value.
在本实施例中,所述多张目标图像数据为目标残差神经网络输出的特征值,将特征值转换成用于描述目标图像数据的图像特征,目标图像数据可以是MS+VGGface或者EmotionNet,通过K-means聚类方法进行k=7的聚类,得到7个聚聚类中心,对于每个聚类中心P i计算一个不交叉半径,计为R i(i=1,…,7),每个R i切分8份,标为R i,j(j=1,…,8),对于每个聚类中心P i,半径R i,j在数据集EmotionNet中搜寻一张人脸表情图作为基准表情图。最终将搜寻到56张基准图,每类表情8张基准表情图,记为A i,jIn this embodiment, the multiple pieces of target image data are feature values output by the target residual neural network, and the feature values are converted into image features used to describe the target image data. The target image data can be MS+VGGface or EmotionNet, by K-means clustering algorithm clusters k = 7, get together to give 7 cluster centers, each cluster center P i for calculating an intersecting radius, calculated as R i (i = 1, ... , 7) each R i segmentation 8 parts, denoted R i, j (j = 1 , ..., 8), for each cluster center P i, the radius R i, J search for a human face in the data set EmotionNet The emoticon is used as the reference emoticon. In the end, 56 reference images will be searched, 8 reference emoji images of each type of expression, denoted as Ai,j .
步骤204,为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据。Step 204: For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the data corresponding to the cluster center. A corresponding set of paired first input image data.
在本实施例中,训练时,随机从基准表情图集中抽取一张A i,j作为基准图像,比如A i,j表情为开心时,在EmtionNet中对应A i,j表情一张为积极表情,然后再找一张属于开心这 个聚类中心,但是不在一个基准图的其他图像,两张作为第一输入图像进行输入。对于一张目标图像数据,一个聚类中心对应一个表情,一个表情具有一组配对的第一输入图像。当中的配对的第一输入图像是指同一个聚类中心的两张基准图。 In the present embodiment, training, randomly drawn from the reference focus an emoticon A i, j as a reference image, such as A i, j is happy face, corresponding to the A i in EmtionNet, j is a positive expression expression , And then find another image that belongs to the happy cluster center, but is not in a reference image, and two are input as the first input image. For a piece of target image data, one cluster center corresponds to one expression, and one expression has a set of paired first input images. The paired first input image refers to two reference images of the same cluster center.
在本申请其它实施方式中,对于一张目标图像数据,也可以随机抽取同一所述聚类中心,并且不同基准图的三张或三张以上的所述目标图像数据作为第一输入图像数据。此时配对的第一输入图像则为同一个聚类中心的多张基准图。In other embodiments of the present application, for a piece of target image data, the same cluster center can also be randomly selected, and three or more pieces of the target image data of different reference images are used as the first input image data. At this time, the paired first input images are multiple reference images of the same cluster center.
步骤205,为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据。Step 205: For each paired first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain the first input image data corresponding to the first input image data. 2. Input image data.
在本实施例中,随机一种其他聚类中心的表情,以上述例子为例,比如生气作为反馈表情,并EmtionNet中对应不开心表情一张为负反馈输入。In this embodiment, an expression of another cluster center is randomized. Taking the above example as an example, for example, anger is used as a feedback expression, and the corresponding unhappy expression in EmtionNet is a negative feedback input.
本申请其它实施方式中,随机抽取相对应不同所述聚类中心的基准图的数量可以为一张,也可以为两张或两张以上。In other embodiments of the present application, the number of reference graphs corresponding to different cluster centers may be randomly selected as one, or two or more.
因此,对应每一张目标图像数据,随机抽取至少三张基准图作为输入数据,输入至EmtionNet进行训练。Therefore, corresponding to each target image data, at least three reference images are randomly selected as input data and input to EmtionNet for training.
步骤206,将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet。Step 206: Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet.
在本实施例中,将这些信息输入至神经网络,进行训练。In this embodiment, this information is input to the neural network for training.
步骤207,通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。Step 207: Train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
在本实施例中,通过不同平常的三元损失函数训练方法,基准图是固定56基准图,解决训练不稳定现象和样本污染问题。三元损失函数为L=max(d(a,p)-d(a,n)+m arg in,0)。其中d(a,p)为同一个聚类中心的输入图像,d(a,n)为不同一个聚类中心的输入图像,m arg in为超参数。In this embodiment, through different usual ternary loss function training methods, the reference image is a fixed 56 reference image, which solves the problem of training instability and sample contamination. The ternary loss function is L=max(d(a,p)-d(a,n)+m arg in,0). Among them, d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, and margin is the hyperparameter.
本申请提出一种新的基于标准性的表情识别方法,不同于以往分类训练方法,而是先在人脸识别训练数据上使用损失函数训练一模型,再对表情数据上使用线性回归函数进行微调,通过这种方式已然训练出一个精度不错的分类模型,用此模型对表情数据进行7聚类,根据聚类结果计算类半径,获取出56张基准表情图,每种表情8张,基准图将作为三元损失函数的基础图;不同于以往三元损失函数训练随机设定基础图,本文使用基准图作为基础图,克服由于标注数据的主观性导致的分类漂移和错误,也避免随机基础图方法导致的难于训练和精度下降的问题。This application proposes a new standard-based facial expression recognition method, which is different from the previous classification training methods. Instead, a loss function is used to train a model on the face recognition training data, and then a linear regression function is used to fine-tune the facial expression data. , In this way, a classification model with good accuracy has been trained. This model is used to perform 7 clustering on the expression data, and calculate the class radius according to the clustering results to obtain 56 reference expression images, 8 of each expression, reference image Will be used as the base map of the ternary loss function; different from the previous ternary loss function training to randomly set the base map, this article uses the reference map as the base map to overcome the classification drift and errors caused by the subjectivity of the labeled data, and also avoid the random basis The problem of difficulty in training and decreased accuracy caused by the graph method.
在一些可选的实现方式中,所述通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值的步骤具体包括:In some optional implementation manners, the residual neural network is trained through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the plurality of second training image data. The steps of corresponding output feature values of a training image specifically include:
通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络;Training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network;
获取第二训练图像数据,通过所述第二训练图像数据微调所述训练好的残差神经网络,得到目标残差神经网络;Acquiring second training image data, fine-tuning the trained residual neural network through the second training image data, to obtain a target residual neural network;
去除所述目标残差神经网络的逻辑回归层,将所述多张第一训练图像数据输入至所述目标残差神经网络,得到所述多张第一训练图像对应输出的特征值。The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
上述实施方式中,使用人脸识别MS+VGGface数据,通过损失函数训练ResNet50,然后表情数据上EmotionNet表情数据进行迁移学习训练,训练包含了softmax层,当第一输入图像输入进去以后,通过去除softmax层可以得到每个第一输入图像的特征值,通过上述方式可以获得每个图像的特征值,从而可以用特征值去描述每张图像。In the above embodiment, using face recognition MS+VGGface data, train ResNet50 through the loss function, and then perform migration learning training on the EmotionNet expression data on the expression data. The training includes the softmax layer. After the first input image is input, the softmax is removed by removing the softmax. The layer can obtain the feature value of each first input image, and the feature value of each image can be obtained in the above-mentioned manner, so that the feature value can be used to describe each image.
在一些可选的实现方式中,所述通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络的步骤具体包括:In some optional implementation manners, the step of training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network specifically includes:
获取所述多张第一训练图像数据以及所述第一训练图像数据所对应的标注标签;Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;
将所述第一训练图像数据以及所述对应的标注标签输入至所述初始残差神经网络;Inputting the first training image data and the corresponding label to the initial residual neural network;
通过
Figure PCTCN2020122822-appb-000001
训练所述初始残差神经网络,得到训练好的残差神经网络,其中i,j为所述第一训练图像数据的图像标号,x为所述残差神经网络输出特征,W为神经元的权重,m为超参数,L为损失函数的值,s为固定值,
Figure PCTCN2020122822-appb-000002
为向量i以及向量j之间的夹角,X*为所述残差神经网络输出特征归一化前的值,W*为所述神经元的权重归一化前的值;
by
Figure PCTCN2020122822-appb-000001
Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
Figure PCTCN2020122822-appb-000002
Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;
将所述训练好的残差神经网络部署至客户端。Deploy the trained residual neural network to the client.
上述实施方式中,通过将公式中的m作为角度加上去了,这样就强行拉大了同类之间的角度,使得神经网络更努力地将同类收得更紧。对x和W进行归一化,计算得到预测向量
Figure PCTCN2020122822-appb-000003
从cos(θ j+i)中挑出对应正确的值,计算其反余弦得到角度,角度加上m,得到挑出从
Figure PCTCN2020122822-appb-000004
中挑出正确的值以及所在位置的独热码,将
Figure PCTCN2020122822-appb-000005
通过独热码放回原来的位置,对所有值乘上固定值s,通过上述方式可以训练EmotionNet神经网络,能得到一个较好的训练模型。
In the above-mentioned embodiment, by adding m in the formula as an angle, the angle between the same kind is forcibly enlarged, making the neural network work harder to tighten the same kind. Normalize x and W, calculate the prediction vector
Figure PCTCN2020122822-appb-000003
Pick the correct value from cos(θ j +i), calculate its arc cosine to get the angle, add m to the angle, and get the pick from
Figure PCTCN2020122822-appb-000004
Pick out the correct value and the one-hot code at the location, and set
Figure PCTCN2020122822-appb-000005
By putting the one-hot code back to the original position, multiplying all values by the fixed value s, the EmotionNet neural network can be trained by the above method, and a better training model can be obtained.
在一些可选的实现方式中,所述根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图的步骤之前还包括:In some optional implementation manners, according to the feature value, obtaining multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data Before the step, it also includes:
通过k均值聚类算法聚类所述述多张第一训练图像对应输出的特征值,得到7个聚类中心;Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;
预设第一预设值m;Preset a first preset value m;
通过k均值聚类算法为每个所述聚类中心聚类所述第一预设值个聚类中心,得到每个所述聚类中心对应的m个基准图。Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.
上述实施方式中,聚类的目的也是把数据分类,但是事先是不知道如何去区分的,通过判断各条数据之间的相似性,相似的则放在一起,聚类属于无监督问题,给出的数据没有标签值,需要机器算法自行去探索其中的规律,根据该规律将相近的数据划分为一类。K均值聚类(K-Means)算法是最为经典的基于划分的聚簇方法,是十大经典数据挖掘算法之一。简单的说K-Means就是在没有任何监督信号的情况下将数据分为K份的一种方法。聚类算法就是无监督学习中最常见的一种,给定一组数据,需要聚类算法去挖掘数据中的隐含信息,通过聚类可以将特征值相似的图像放在一起,达到初步区分的目的。In the above embodiment, the purpose of clustering is also to classify data, but it is not known how to distinguish in advance. By judging the similarity between various pieces of data, the similar ones are put together. Clustering is an unsupervised problem. The output data has no label value, and the machine algorithm needs to explore the law by itself, and divide the similar data into one category according to the law. The K-Means algorithm is the most classic partition-based clustering method, and it is one of the ten classic data mining algorithms. Simply put, K-Means is a method of dividing data into K parts without any supervision signal. Clustering algorithm is the most common one in unsupervised learning. Given a set of data, a clustering algorithm is needed to mine the hidden information in the data. Through clustering, images with similar feature values can be put together to achieve a preliminary distinction. the goal of.
在一些可选的实现方式中,所述通过三元损失函数训练所述EmtionNet的步骤具体包括:In some optional implementation manners, the step of training the EmtionNet through a ternary loss function specifically includes:
通过L=max(d(a,p)-d(a,n)+m arg in,0)训练所述EmtionNet,得到EmtionNet,其中d(a,p)为同一个聚类中心的输入图像,d(a,n)为不同一个聚类中心的输入图像,m arg in为超参数;Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;
将所述训练好的EmtionNet部署至客户端。Deploy the trained EmtionNet to the client.
上述实施方式中,通过上述方式,输入图像中包含三张图像,一张为基础聚类中心的图,另外一张为同一个聚类中心的图像,最后一张则为不同聚类中心的图像。a为基础聚类中心的图,p为同一个聚类中心的图像,n为不同聚类中心的图像。可以优化目标,使 得a与p的距离拉近,拉远a与n的距离拉远。In the above embodiment, through the above method, the input image contains three images, one is the image of the basic cluster center, the other is the image of the same cluster center, and the last is the image of different cluster centers. . a is the image of the basic cluster center, p is the image of the same cluster center, and n is the image of different cluster centers. The goal can be optimized, so that the distance between a and p is shortened, and the distance between a and n is extended.
在一些可选的实现方式中,所述通过三元损失函数训练所述EmtionNet的步骤之后还包括:In some optional implementation manners, after the step of training the EmtionNet through a ternary loss function, the method further includes:
获取多张测试集图像以及所述多张测试集图像对应的表情标签;Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;
将所述多张测试集图像输入至所述训练好的EmtionNet,得到多个表情识别结果;Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;
若所述表情标签与对应所述表情识别结果相同,则将所述测试集图像对应的识别结果设为正确;If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;
统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度。The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
上述实施方式中,若所述表情标签与对应所述表情识别结果不同,则将所述测试集图像对应的识别结果设为错误;为每张测试集图像标注对应的表情标签,以及对应的基准图,以开心作为输入图像为例,则选取一张开心作为输入图像,再选取一张不同基准图,并且同为开心的图作为第一输入图像,然后选取一个非开心的图作为输入图像,输入到模型进行测试,若得到结果为开心,则是识别正确,若不是,则识别错误,通过识别所有测试集图像,初步估计模型的准确率。In the foregoing embodiment, if the expression label is different from the corresponding expression recognition result, the recognition result corresponding to the test set image is set as an error; each test set image is marked with a corresponding expression label and a corresponding benchmark Picture, taking happy as the input image as an example, select a happy picture as the input image, then select a different reference picture, and the same happy picture as the first input image, and then select an unhappy picture as the input image, Input to the model for testing, if the result is happy, the recognition is correct, if not, the recognition is wrong, and the accuracy of the model is preliminarily estimated by recognizing all the images of the test set.
所述统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度之后还包括:The counting the number of correct recognition results, and calculating the percentage of the number of correct recognition results to the number of emoticon tags, as the accuracy of the EmtionNet, further includes:
若所述EmtionNet的准确度低于预设精确度,则调整所述EmtionNet模型中的参数,重新训练。If the accuracy of the EmtionNet is lower than the preset accuracy, adjust the parameters in the EmtionNet model and retrain.
上述实施方式中,如果准确率过低,则调整神经网络参数,重新训练,得到新的神经元权值,提高识别的准确率。In the foregoing embodiment, if the accuracy rate is too low, the neural network parameters are adjusted and retrained to obtain new neuron weights to improve the accuracy of recognition.
需要强调的是,为进一步保证所述多张第一训练图像数据以及所述多张第二训练图像数据的私密和安全性,所述多张第一训练图像数据以及所述多张第二训练图像数据还可以存储于一区块链的节点中。It should be emphasized that in order to further ensure the privacy and security of the multiple first training image data and the multiple second training image data, the multiple first training image data and the multiple second training image data Image data can also be stored in a node of a blockchain.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该流程在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a computer-readable storage medium. When the process is executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the drawings are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least part of the steps in the flowchart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种建立表情识别模型装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 3, as an implementation of the method shown in FIG. 2, this application provides an embodiment of a device for establishing an expression recognition model. The device embodiment corresponds to the method embodiment shown in FIG. Specifically, it can be applied to various electronic devices.
如图3所示,本实施例所述的建立表情识别模型装置300包括:训练数据获取模块301、残差神经网络训练模块302、基准图获取模块303、聚类模块304、抽取模块305、输入模块306以及EmtionNet训练模块305。其中:As shown in FIG. 3, the apparatus 300 for establishing an expression recognition model in this embodiment includes: a training data acquisition module 301, a residual neural network training module 302, a reference map acquisition module 303, a clustering module 304, an extraction module 305, and input Module 306 and EmtionNet training module 305. among them:
训练数据获取模块301用于获取多张第一训练图像数据以及多张第二训练图像数据;The training data acquisition module 301 is used to acquire multiple pieces of first training image data and multiple pieces of second training image data;
残差神经网络训练模块302用于通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;The residual neural network training module 302 is configured to train a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain a target residual neural network and the plurality of first training images. The corresponding output feature value of the image;
基准图获取模块303用于根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;The reference image acquisition module 303 is configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;
聚类模块304用于为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;The clustering module 304 is configured to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the same clustering center. A group of paired first input image data corresponding to the class center;
抽取模块305用于为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;The extraction module 305 is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data, to obtain the data corresponding to the first input image data The second input image data;
输入模块306用于将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;The input module 306 is configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;
EmtionNet训练模块307用于通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet training module 307 is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
在本实施例的一些可选的实现方式中,上述残差神经网络训练模块进一步用于:In some optional implementation manners of this embodiment, the above-mentioned residual neural network training module is further used for:
通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络;Training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network;
获取第二训练图像数据,通过所述第二训练图像数据微调所述训练好的残差神经网络,得到目标残差神经网络;Acquiring second training image data, fine-tuning the trained residual neural network through the second training image data, to obtain a target residual neural network;
去除所述目标残差神经网络的逻辑回归层,将所述多张第一训练图像数据输入至所述目标残差神经网络,得到所述多张第一训练图像对应输出的特征值。The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
在本实施例的一些可选的实现方式中,上述残差神经网络训练模块进一步用于:In some optional implementation manners of this embodiment, the above-mentioned residual neural network training module is further used for:
获取所述多张第一训练图像数据以及所述第一训练图像数据所对应的标注标签;Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;
将所述第一训练图像数据以及所述对应的标注标签输入至所述初始残差神经网络;Inputting the first training image data and the corresponding label to the initial residual neural network;
通过
Figure PCTCN2020122822-appb-000006
训练所述初始残差神经网络,得到训练好的残差神经网络,其中i,j为所述第一训练图像数据的图像标号,x为所述残差神经网络输出特征,W为神经元的权重,m为超参数,L为损失函数的值,s为固定值,
Figure PCTCN2020122822-appb-000007
为向量i以及向量j之间的夹角,X*为所述残差神经网络输出特征归一化前的值,W*为所述神经元的权重归一化前的值;
by
Figure PCTCN2020122822-appb-000006
Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
Figure PCTCN2020122822-appb-000007
Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;
将所述训练好的残差神经网络部署至客户端。Deploy the trained residual neural network to the client.
在本实施例的一些可选的实现方式中,上述装置300还包括:聚类模块用于:In some optional implementation manners of this embodiment, the above-mentioned apparatus 300 further includes: a clustering module is configured to:
通过k均值聚类算法聚类所述述多张第一训练图像对应输出的特征值,得到7个聚类中心;Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;
预设第一预设值m;Preset a first preset value m;
通过k均值聚类算法为每个所述聚类中心聚类所述第一预设值个聚类中心,得到每个所述聚类中心对应的m个基准图。Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.
在本实施例的一些可选的实现方式中,上述EmtionNet训练模块进一步用于:In some optional implementations of this embodiment, the above-mentioned EmtionNet training module is further used for:
通过L=max(d(a,p)-d(a,n)+m arg in,0)训练所述EmtionNet,得到EmtionNet,其中d(a,p)为同一个聚类中心的输入图像,d(a,n)为不同一个聚类中心的输入图像,m arg in为超参数;Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;
将所述训练好的EmtionNet部署至客户端。Deploy the trained EmtionNet to the client.
在本实施例的一些可选的实现方式中,上述装置300还包括:测试模块用于:In some optional implementation manners of this embodiment, the above-mentioned apparatus 300 further includes: a test module for:
获取多张测试集图像以及所述多张测试集图像对应的表情标签;Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;
将所述多张测试集图像输入至所述训练好的EmtionNet,得到多个表情识别结果;Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;
若所述表情标签与对应所述表情识别结果相同,则将所述测试集图像对应的识别结果设为正确;If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;
统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度。The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
在本实施例的一些可选的实现方式中,上述装置300还包括:调试模块用于:In some optional implementation manners of this embodiment, the above-mentioned apparatus 300 further includes: a debugging module for:
若所述EmtionNet的准确度低于预设精确度,则调整所述EmtionNet模型中的参数,重新训练。If the accuracy of the EmtionNet is lower than the preset accuracy, adjust the parameters in the EmtionNet model and retrain.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable GateArray,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with components 41-43, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
所述存储器41至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等,所述计算机可读存储介质可以是非易失性,也可以是易失性。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如建立表情识别模型方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 41 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc., the computer readable storage The medium can be non-volatile or volatile. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk equipped on the computer device 6, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as computer-readable instructions for establishing an expression recognition model method. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述建立表情识别模型方法的计算机可读指令。The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 42 is generally used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run computer-readable instructions or processed data stored in the memory 41, for example, run the computer-readable instructions of the method for establishing an expression recognition model.
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读 存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的建立表情识别模型方法的步骤。The present application also provides another implementation manner, that is, a computer-readable storage medium is provided with computer-readable instructions stored thereon, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to execute the steps of the method for establishing an expression recognition model as described above.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims (20)

  1. 一种建立表情识别模型方法,包括下述步骤:A method for establishing an expression recognition model includes the following steps:
    获取多张第一训练图像数据以及多张第二训练图像数据;Acquiring multiple pieces of first training image data and multiple pieces of second training image data;
    通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;
    根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;
    为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;
    为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;
    将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;
    通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
  2. 根据权利要求1所述的建立表情识别模型方法,其中,所述通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值的步骤具体包括:The method for establishing an expression recognition model according to claim 1, wherein the residual neural network is trained through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network And the step of outputting feature values corresponding to the plurality of first training images specifically includes:
    通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络;获取第二训练图像数据,通过所述第二训练图像数据微调所述训练好的残差神经网络,得到目标残差神经网络;The initial residual neural network is trained through the plurality of first training image data to obtain a trained residual neural network; the second training image data is acquired, and the trained residual neural network is fine-tuned through the second training image data Network, get the target residual neural network;
    去除所述目标残差神经网络的逻辑回归层,将所述多张第一训练图像数据输入至所述目标残差神经网络,得到所述多张第一训练图像对应输出的特征值。The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
  3. 根据权利要求2所述的建立表情识别模型方法,其中,所述通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络的步骤具体包括:The method for establishing an expression recognition model according to claim 2, wherein the step of training an initial residual neural network through the plurality of first training image data to obtain a trained residual neural network specifically comprises:
    获取所述多张第一训练图像数据以及所述第一训练图像数据所对应的标注标签;Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;
    将所述第一训练图像数据以及所述对应的标注标签输入至所述初始残差神经网络;Inputting the first training image data and the corresponding label to the initial residual neural network;
    通过
    Figure PCTCN2020122822-appb-100001
    训练所述初始残差神经网络,得到训练好的残差神经网络,其中i,j为所述第一训练图像数据的图像标号,x为所述残差神经网络输出特征,W为神经元的权重,m为超参数,L为损失函数的值,s为固定值,
    Figure PCTCN2020122822-appb-100002
    为向量i以及向量j之间的夹角,X*为所述残差神经网络输出特征归一化前的值,W*为所述神经元的权重归一化前的值;
    by
    Figure PCTCN2020122822-appb-100001
    Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
    Figure PCTCN2020122822-appb-100002
    Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;
    将所述训练好的残差神经网络部署至客户端。Deploy the trained residual neural network to the client.
  4. 根据权利要求1-3任一项所述的建立表情识别模型方法,其中,所述根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图的步骤之前还包括:The method for establishing an expression recognition model according to any one of claims 1-3, wherein said acquiring multiple pieces of target image data, clustering centers corresponding to the multiple pieces of target image data, and all pieces of target image data according to the feature value Before the step of describing the reference images corresponding to the multiple pieces of target image data, it also includes:
    通过k均值聚类算法聚类所述述多张第一训练图像对应输出的特征值,得到7个聚类中心;Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;
    预设第一预设值m;Preset a first preset value m;
    通过k均值聚类算法为每个所述聚类中心聚类所述第一预设值个聚类中心,得到每个 所述聚类中心对应的m个基准图。Clustering the first preset number of cluster centers for each of the cluster centers by a k-means clustering algorithm, to obtain m reference images corresponding to each of the cluster centers.
  5. 根据权利要求4中所述的建立表情识别模型方法,其中,所述通过三元损失函数训练所述EmtionNet的步骤具体包括:The method for establishing an expression recognition model according to claim 4, wherein the step of training the EmtionNet through a ternary loss function specifically comprises:
    通过L=max(d(a,p)-d(a,n)+m arg in,0)训练所述EmtionNet,得到EmtionNet,其中d(a,p)为同一个聚类中心的输入图像,d(a,n)为不同一个聚类中心的输入图像,m arg in为超参数;Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;
    将所述训练好的EmtionNet部署至客户端。Deploy the trained EmtionNet to the client.
  6. 根据权利要5中所述的建立表情识别模型方法,其中,所述通过三元损失函数训练所述EmtionNet的步骤之后还包括:The method for establishing an expression recognition model according to claim 5, wherein after the step of training the EmtionNet through a ternary loss function, the method further comprises:
    获取多张测试集图像以及所述多张测试集图像对应的表情标签;Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;
    将所述多张测试集图像输入至所述训练好的EmtionNet,得到多个表情识别结果;Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;
    若所述表情标签与对应所述表情识别结果相同,则将所述测试集图像对应的识别结果设为正确;If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;
    统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度。The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
  7. 根据权利要求6所述的建立表情识别模型方法,其中,所述统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度之后还包括:The method for establishing an expression recognition model according to claim 6, wherein the number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of expression tags is calculated as the accuracy of the EmtionNet Also includes:
    若所述EmtionNet的准确度低于预设精确度,则调整所述EmtionNet模型中的参数,重新训练。If the accuracy of the EmtionNet is lower than the preset accuracy, adjust the parameters in the EmtionNet model and retrain.
  8. 一种建立表情识别模型装置,其中,包括:An apparatus for establishing an expression recognition model, which includes:
    训练数据获取模块,用于获取多张第一训练图像数据以及多张第二训练图像数据;The training data acquisition module is used to acquire multiple first training image data and multiple second training image data;
    残差神经网络训练模块,用于通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;The residual neural network training module is used to train the residual neural network through the multiple first training image data and the multiple second training image data to obtain the target residual neural network and the multiple first training The feature value corresponding to the output of the image;
    基准图获取模块,用于根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;A reference image acquisition module, configured to acquire multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data according to the feature value;
    聚类模块,用于为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;The clustering module is used to randomly extract the same cluster center for each piece of the target image data, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain the same clustering center. A group of paired first input image data corresponding to the class center;
    抽取模块,用于为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;The extraction module is configured to randomly extract at least one reference image corresponding to the different cluster centers for each pair of the first input image data of the target image data to obtain the corresponding first input image data The second input image data;
    输入模块,用于将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;An input module, configured to input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data to EmtionNet;
    EmtionNet训练模块,用于通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet training module is used to train the EmtionNet through a ternary loss function to obtain a trained EmtionNet.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下所述的建立表情识别模型方法的步骤:A computer device includes a memory and a processor, and computer-readable instructions are stored in the memory, wherein, when the processor executes the computer-readable instructions, the steps of the method for establishing an expression recognition model as described below are implemented:
    获取多张第一训练图像数据以及多张第二训练图像数据;Acquiring multiple pieces of first training image data and multiple pieces of second training image data;
    通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;
    根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;
    为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两 张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;
    为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;
    将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;
    通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
  10. 如权利要求9所述的计算机设备,其中,所述通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值的步骤具体包括:The computer device according to claim 9, wherein the residual neural network is trained by the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the The steps of corresponding output feature values of the plurality of first training images specifically include:
    通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络;获取第二训练图像数据,通过所述第二训练图像数据微调所述训练好的残差神经网络,得到目标残差神经网络;The initial residual neural network is trained through the plurality of first training image data to obtain a trained residual neural network; the second training image data is acquired, and the trained residual neural network is fine-tuned through the second training image data Network, get the target residual neural network;
    去除所述目标残差神经网络的逻辑回归层,将所述多张第一训练图像数据输入至所述目标残差神经网络,得到所述多张第一训练图像对应输出的特征值。The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
  11. 如权利要求10所述的计算机设备,其中,所述通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络的步骤具体包括:10. The computer device of claim 10, wherein the step of training an initial residual neural network through the plurality of first training image data to obtain a trained residual neural network specifically comprises:
    获取所述多张第一训练图像数据以及所述第一训练图像数据所对应的标注标签;Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;
    将所述第一训练图像数据以及所述对应的标注标签输入至所述初始残差神经网络;Inputting the first training image data and the corresponding label to the initial residual neural network;
    通过
    Figure PCTCN2020122822-appb-100003
    训练所述初始残差神经网络,得到训练好的残差神经网络,其中i,j为所述第一训练图像数据的图像标号,x为所述残差神经网络输出特征,W为神经元的权重,m为超参数,L为损失函数的值,s为固定值,
    Figure PCTCN2020122822-appb-100004
    为向量i以及向量j之间的夹角,X*为所述残差神经网络输出特征归一化前的值,W*为所述神经元的权重归一化前的值;
    by
    Figure PCTCN2020122822-appb-100003
    Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
    Figure PCTCN2020122822-appb-100004
    Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;
    将所述训练好的残差神经网络部署至客户端。Deploy the trained residual neural network to the client.
  12. 如权利要求9-11任一项所述的计算机设备,其中,所述根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图的步骤之前还包括:The computer device according to any one of claims 9-11, wherein the multiple pieces of target image data, the cluster centers corresponding to the multiple pieces of target image data, and the multiple pieces of target image data are obtained according to the characteristic value. Before the step of the reference image corresponding to the target image data, it also includes:
    通过k均值聚类算法聚类所述述多张第一训练图像对应输出的特征值,得到7个聚类中心;Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;
    预设第一预设值m;Preset a first preset value m;
    通过k均值聚类算法为每个所述聚类中心聚类所述第一预设值个聚类中心,得到每个所述聚类中心对应的m个基准图。Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.
  13. 如权利要求12所述的计算机设备,其中,所述通过三元损失函数训练所述EmtionNet的步骤具体包括:The computer device according to claim 12, wherein the step of training the EmtionNet through a ternary loss function specifically comprises:
    通过L=max(d(a,p)-d(a,n)+m arg in,0)训练所述EmtionNet,得到EmtionNet,其中d(a,p)为同一个聚类中心的输入图像,d(a,n)为不同一个聚类中心的输入图像,m arg in为超参数;Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;
    将所述训练好的EmtionNet部署至客户端。Deploy the trained EmtionNet to the client.
  14. 如权利要求13所述的计算机设备,其中,所述通过三元损失函数训练所述EmtionNet的步骤之后还包括:The computer device according to claim 13, wherein after the step of training the EmtionNet through a ternary loss function, the method further comprises:
    获取多张测试集图像以及所述多张测试集图像对应的表情标签;Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;
    将所述多张测试集图像输入至所述训练好的EmtionNet,得到多个表情识别结果;Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;
    若所述表情标签与对应所述表情识别结果相同,则将所述测试集图像对应的识别结果设为正确;If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;
    统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度。The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的建立表情识别模型方法的步骤:A computer-readable storage medium, wherein computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the method for establishing an expression recognition model as described below are realized:
    获取多张第一训练图像数据以及多张第二训练图像数据;Acquiring multiple pieces of first training image data and multiple pieces of second training image data;
    通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值;Training a residual neural network through the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network and the feature values corresponding to the output of the plurality of first training images;
    根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图;Acquiring, according to the feature value, multiple pieces of target image data, cluster centers corresponding to the multiple pieces of target image data, and reference images corresponding to the multiple pieces of target image data;
    为每一张所述目标图像数据,随机抽取同一所述聚类中心,并且不同基准图的至少两张所述目标图像数据作为第一输入图像数据,得到与所述聚类中心相对应的一组配对的第一输入图像数据;For each piece of the target image data, randomly extract the same cluster center, and at least two pieces of the target image data of different reference images are used as the first input image data to obtain one corresponding to the cluster center. The paired first input image data;
    为每一张目标图像数据的配对的所述第一输入图像数据,随机抽取相对应不同所述聚类中心的至少一张基准图,得到与所述第一输入图像数据对应的第二输入图像数据;For each pair of the first input image data of the target image data, randomly extract at least one reference image corresponding to the different cluster centers to obtain a second input image corresponding to the first input image data data;
    将所述第一输入图像数据、所述第二输入图像数据以及所述第一输入图像数据对应的聚类中心输入至EmtionNet;Input the first input image data, the second input image data, and the cluster centers corresponding to the first input image data into EmtionNet;
    通过三元损失函数训练所述EmtionNet,得到训练好的EmtionNet。The EmtionNet is trained through the ternary loss function to obtain a trained EmtionNet.
  16. 如权利要求15所述的计算机可读存储介质,其中,所述通过所述多张第一训练图像数据以及所述多张第二训练图像数据,训练残差神经网络,得到目标残差神经网络以及所述多张第一训练图像对应输出的特征值的步骤具体包括:The computer-readable storage medium according to claim 15, wherein the residual neural network is trained by the plurality of first training image data and the plurality of second training image data to obtain the target residual neural network And the step of outputting feature values corresponding to the plurality of first training images specifically includes:
    通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络;Training an initial residual neural network by using the plurality of first training image data to obtain a trained residual neural network;
    获取第二训练图像数据,通过所述第二训练图像数据微调所述训练好的残差神经网络,得到目标残差神经网络;Acquiring second training image data, fine-tuning the trained residual neural network through the second training image data, to obtain a target residual neural network;
    去除所述目标残差神经网络的逻辑回归层,将所述多张第一训练图像数据输入至所述目标残差神经网络,得到所述多张第一训练图像对应输出的特征值。The logistic regression layer of the target residual neural network is removed, and the multiple first training image data are input to the target residual neural network to obtain feature values corresponding to the output of the multiple first training images.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述通过所述多张第一训练图像数据训练初始残差神经网络,得到训练好的残差神经网络的步骤具体包括:15. The computer-readable storage medium of claim 16, wherein the step of training an initial residual neural network through the plurality of first training image data to obtain a trained residual neural network specifically comprises:
    获取所述多张第一训练图像数据以及所述第一训练图像数据所对应的标注标签;Acquiring the multiple pieces of first training image data and the annotation labels corresponding to the first training image data;
    将所述第一训练图像数据以及所述对应的标注标签输入至所述初始残差神经网络;Inputting the first training image data and the corresponding label to the initial residual neural network;
    通过
    Figure PCTCN2020122822-appb-100005
    训练所述初始残差神经网络,得到训练好的残差神经网络,其中i,j为所述第一训练图像数据的图像标号,x为所述残差神经网络输出特征,W为神经元的权重,m为超参数,L为损失函数的值,s为固定值,
    Figure PCTCN2020122822-appb-100006
    为 向量i以及向量j之间的夹角,X*为所述残差神经网络输出特征归一化前的值,W*为所述神经元的权重归一化前的值;
    by
    Figure PCTCN2020122822-appb-100005
    Train the initial residual neural network to obtain a trained residual neural network, where i, j are the image labels of the first training image data, x is the output feature of the residual neural network, and W is the neuron’s Weight, m is a hyperparameter, L is the value of the loss function, s is a fixed value,
    Figure PCTCN2020122822-appb-100006
    Is the angle between vector i and vector j, X* is the value before normalization of the residual neural network output feature, and W* is the value before normalization of the weight of the neuron;
    将所述训练好的残差神经网络部署至客户端。Deploy the trained residual neural network to the client.
  18. 如权利要求15-17任一项所述的计算机可读存储介质,其中,所述根据所述特征值,获取多张目标图像数据、所述多张目标图像数据对应的聚类中心、以及所述多张目标图像数据对应的基准图的步骤之前还包括:The computer-readable storage medium according to any one of claims 15-17, wherein the acquiring multiple pieces of target image data, the cluster centers corresponding to the multiple pieces of target image data, and all the pieces of target image data according to the characteristic value Before the step of describing the reference images corresponding to the multiple pieces of target image data, it also includes:
    通过k均值聚类算法聚类所述述多张第一训练图像对应输出的特征值,得到7个聚类中心;Clustering the corresponding output feature values of the plurality of first training images by k-means clustering algorithm to obtain 7 cluster centers;
    预设第一预设值m;Preset a first preset value m;
    通过k均值聚类算法为每个所述聚类中心聚类所述第一预设值个聚类中心,得到每个所述聚类中心对应的m个基准图。Clustering the first preset number of cluster centers for each cluster center through a k-means clustering algorithm, to obtain m reference maps corresponding to each cluster center.
  19. 如权利要求18所述的计算机可读存储介质,其中,其中,所述通过三元损失函数训练所述EmtionNet的步骤具体包括:18. The computer-readable storage medium of claim 18, wherein the step of training the EmtionNet through a ternary loss function specifically comprises:
    通过L=max(d(a,p)-d(a,n)+m arg in,0)训练所述EmtionNet,得到EmtionNet,其中d(a,p)为同一个聚类中心的输入图像,d(a,n)为不同一个聚类中心的输入图像,m arg in为超参数;Train the EmtionNet through L=max(d(a,p)-d(a,n)+m argin,0) to obtain EmtionNet, where d(a,p) is the input image of the same cluster center, d(a,n) is the input image of a different cluster center, m arg in is the hyperparameter;
    将所述训练好的EmtionNet部署至客户端。Deploy the trained EmtionNet to the client.
  20. 如权利要求19所述的计算机可读存储介质,其中,所述通过三元损失函数训练所述EmtionNet的步骤之后还包括:19. The computer-readable storage medium of claim 19, wherein after the step of training the EmtionNet through a ternary loss function, the method further comprises:
    获取多张测试集图像以及所述多张测试集图像对应的表情标签;Acquiring a plurality of test set images and expression tags corresponding to the plurality of test set images;
    将所述多张测试集图像输入至所述训练好的EmtionNet,得到多个表情识别结果;Input the multiple test set images to the trained EmtionNet to obtain multiple expression recognition results;
    若所述表情标签与对应所述表情识别结果相同,则将所述测试集图像对应的识别结果设为正确;If the expression tag is the same as the corresponding expression recognition result, the recognition result corresponding to the test set image is set to be correct;
    统计正确识别结果的数量,并计算所述正确识别结果的数量与所述表情标签数量的百分比,作为所述EmtionNet的准确度。The number of correct recognition results is counted, and the percentage of the number of correct recognition results to the number of emoticon tags is calculated as the accuracy of the EmtionNet.
PCT/CN2020/122822 2020-07-31 2020-10-22 Method and apparatus for establishing expression recognition model, and computer device and storage medium WO2021139316A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010761705.0A CN111898550B (en) 2020-07-31 2020-07-31 Expression recognition model building method and device, computer equipment and storage medium
CN202010761705.0 2020-07-31

Publications (1)

Publication Number Publication Date
WO2021139316A1 true WO2021139316A1 (en) 2021-07-15

Family

ID=73182963

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122822 WO2021139316A1 (en) 2020-07-31 2020-10-22 Method and apparatus for establishing expression recognition model, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN111898550B (en)
WO (1) WO2021139316A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240699A (en) * 2021-12-22 2022-03-25 长春嘉诚信息技术股份有限公司 Criminal reconstruction means recommendation method based on cycle sign correction
WO2023221713A1 (en) * 2022-05-16 2023-11-23 腾讯科技(深圳)有限公司 Image encoder training method and apparatus, device, and medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669876A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium
CN112631888A (en) * 2020-12-30 2021-04-09 航天信息股份有限公司 Fault prediction method and device of distributed system, storage medium and electronic equipment
CN113807265B (en) * 2021-09-18 2022-05-06 山东财经大学 Diversified human face image synthesis method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109002790A (en) * 2018-07-11 2018-12-14 广州视源电子科技股份有限公司 A kind of method, apparatus of recognition of face, equipment and storage medium
CN109658445A (en) * 2018-12-14 2019-04-19 北京旷视科技有限公司 Network training method, increment build drawing method, localization method, device and equipment
CN110555390A (en) * 2019-08-09 2019-12-10 厦门市美亚柏科信息股份有限公司 pedestrian re-identification method, device and medium based on semi-supervised training mode
CN111191587A (en) * 2019-12-30 2020-05-22 兰州交通大学 Pedestrian re-identification method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559504B (en) * 2013-11-04 2016-08-31 北京京东尚科信息技术有限公司 Image target category identification method and device
CN108629414B (en) * 2018-05-09 2020-04-14 清华大学 Deep hash learning method and device
CN111310808B (en) * 2020-02-03 2024-03-22 平安科技(深圳)有限公司 Training method and device for picture recognition model, computer system and storage medium
CN111460923A (en) * 2020-03-16 2020-07-28 平安科技(深圳)有限公司 Micro-expression recognition method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109002790A (en) * 2018-07-11 2018-12-14 广州视源电子科技股份有限公司 A kind of method, apparatus of recognition of face, equipment and storage medium
CN109658445A (en) * 2018-12-14 2019-04-19 北京旷视科技有限公司 Network training method, increment build drawing method, localization method, device and equipment
CN110555390A (en) * 2019-08-09 2019-12-10 厦门市美亚柏科信息股份有限公司 pedestrian re-identification method, device and medium based on semi-supervised training mode
CN111191587A (en) * 2019-12-30 2020-05-22 兰州交通大学 Pedestrian re-identification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NGO QUAN T., YOON SEOKHOON: ""Facial Expression Recognition Based on Weighted-Cluster Loss and Deep Transfer Learning Using a Highly Imbalanced Dataset"", SENSORS, vol. 20, no. 9, 2639, 5 May 2020 (2020-05-05), pages 1 - 21, XP055826680 *
ZHANG JIANMING, LU CHAOQUAN, WANG JIN, YUE XIAO-GUANG, LIM SE-JUNG, AL-MAKHADMEH ZAFER, TOLBA AMR: ""Training Convolutional Neural Networks with Multi-Size Images and Triplet Loss for Remote Sensing Scene Classification"", SENSORS, vol. 20, no. 4, 1188, 21 February 2020 (2020-02-21), pages 1 - 21, XP055826677, DOI: 10.3390/s20041188 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240699A (en) * 2021-12-22 2022-03-25 长春嘉诚信息技术股份有限公司 Criminal reconstruction means recommendation method based on cycle sign correction
WO2023221713A1 (en) * 2022-05-16 2023-11-23 腾讯科技(深圳)有限公司 Image encoder training method and apparatus, device, and medium

Also Published As

Publication number Publication date
CN111898550A (en) 2020-11-06
CN111898550B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
WO2021139316A1 (en) Method and apparatus for establishing expression recognition model, and computer device and storage medium
CN108416370B (en) Image classification method and device based on semi-supervised deep learning and storage medium
CN107680019B (en) Examination scheme implementation method, device, equipment and storage medium
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
WO2022105118A1 (en) Image-based health status identification method and apparatus, device and storage medium
WO2021218029A1 (en) Artificial intelligence-based interview method and apparatus, computer device, and storage medium
US11856277B2 (en) Method and apparatus for processing video, electronic device, medium and product
WO2021218028A1 (en) Artificial intelligence-based interview content refining method, apparatus and device, and medium
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
CN112668482B (en) Face recognition training method, device, computer equipment and storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN113127633A (en) Intelligent conference management method and device, computer equipment and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN113707299A (en) Auxiliary diagnosis method and device based on inquiry session and computer equipment
CN112417121A (en) Client intention recognition method and device, computer equipment and storage medium
CN112418059A (en) Emotion recognition method and device, computer equipment and storage medium
CN114780701A (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN112434746B (en) Pre-labeling method based on hierarchical migration learning and related equipment thereof
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN111339290A (en) Text classification method and system
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
CN114637831A (en) Data query method based on semantic analysis and related equipment thereof
CN112446360A (en) Target behavior detection method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911325

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911325

Country of ref document: EP

Kind code of ref document: A1