Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary architecture 100 to which embodiments of the facial expression recognition method or apparatus of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
Various client applications may be installed on the terminal devices 101, 102, 103. Such as image processing applications, search applications, content sharing applications, art-beautifying applications, instant messaging applications, and the like. The terminal devices 101, 102, 103 may interact with the server 105 via the network 104 to receive or send messages or the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices that can receive user operations, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a background server supporting client applications installed on the terminal devices 101, 102, 103. The server 105 may recognize a facial expression presented by the face image of the target object, resulting in facial expression information corresponding to the face image.
The server 105 may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the facial expression recognition method provided by the embodiment of the present disclosure may be executed by the server 105, and may also be executed by the terminal devices 101, 102, and 103. Accordingly, the facial expression recognition apparatus may be provided in the server 105, or may be provided in the terminal devices 101, 102, 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where data used in obtaining the facial expression information does not need to be acquired from a remote place, the above system architecture may include no network, only a terminal device or a server.
With continued reference to fig. 2, a flow 200 of one embodiment of a facial expression recognition method according to the present disclosure is shown. The facial expression recognition method comprises the following steps:
step 201, inputting the acquired facial images including the target object into a preset number of facial expression recognition models trained in advance to obtain a facial expression recognition result of the target object output by each facial expression recognition model.
In the present embodiment, the execution subject of the above-described facial expression recognition method (e.g., the terminal apparatus 101, 102, 103 or the server 105 shown in fig. 1) may be equipped with or connected to a photographing apparatus. The face image may be captured by the capturing device and transmitted to the execution subject. Alternatively, the face image may be stored locally in advance. The execution main body may acquire the face image through path information indicating a location where the face image is stored. Here, the face image may be a face image of a target object, which may be, for example, a user, may be an animal (e.g., a kitten, a puppy), or the like.
In this embodiment, after acquiring the face images, the execution subject may input the acquired face images to a preset number of facial expression recognition models. Here, the preset number may be preset. There may be 2, 3, etc. Each facial expression recognition model is used for recognizing the facial expression of the object presented by the facial image, so that a facial expression recognition result output by each facial expression recognition model is obtained. Here, the facial expression may include, for example, but not limited to, crying, laughing, hurdling, surprise, frightening, willingness, frowning, glaring, and the like. And are not limited herein. In general, different facial expression recognition models are trained based on different training samples. Therefore, when the facial expression presented by the same facial image is blurred or exaggerated (for example, a hard facial expression), the facial expression indication information indicating that the facial expression is presented is not completely the same for different facial image recognition models. When a facial expression recognition model is used for recognition, the accuracy of recognized facial expression information is low. Therefore, the facial expressions presented by the facial images are identified by the facial expression identification models with the preset number, so that the accuracy of the obtained facial expression information can be improved, and the interestingness is improved.
As an example, each of the preset number of facial expression recognition models may be a correspondence table in which a plurality of facial images and corresponding facial expression information are stored, which is previously prepared by a technician based on statistics of a large number of facial images and facial expression recognition results for characterizing the facial images; the model may be a model obtained by training an initial model (e.g., a neural network) by a machine learning method based on a preset training sample. It should be noted that the facial expression recognition result may include, but is not limited to, at least one of the following: characters, numbers, symbols, images.
By way of example, each of the facial expression recognition models described above may include, but is not limited to, a convolutional layer, a pooling layer, and a fully-connected layer. The convolutional layer is used for extracting features of parts for recognizing facial expressions in the facial image. For example, features of parts such as eyebrows, eyes, mouth shapes, and facial muscles may be extracted to obtain feature maps or feature vectors for the respective parts. Then, the obtained feature map or feature vector is input to the full link layer, thereby outputting probability values belonging to each preset facial expression information. And finally, selecting the facial expression information with the maximum probability value as the facial expression information corresponding to the facial image.
Step 202, counting the obtained facial expression recognition result.
In this embodiment, the executing entity may perform statistics on the facial expression recognition result based on the facial expression recognition result recognized by each facial expression recognition model obtained in step 201. As an example, after a facial image is input to the 3 facial expression recognition models, facial expression information obtained is smile, cry, and laugh, respectively. Then, the number of occurrences of facial expression information "smile" was counted as 2 times, and the number of occurrences of facial expression information "cry" was counted as 1 time.
In step 203, facial expression information corresponding to the target object is determined based on the statistical result.
In this embodiment, the execution subject may take the facial expression corresponding to the largest number of facial expression recognition results as the facial expression of the target object based on the statistics of the facial expression recognition results. Specifically, in the example shown in step 202, since "smile" occurs the most frequently, it is possible to determine "smile" as the facial expression information corresponding to the target object.
It is to be noted herein that the preset number of facial expression recognition models may be an odd number of facial expression recognition models. By setting the preset number to be odd, it is possible to avoid that, when an even number, for example, 4, is used, it is avoided that the execution subject cannot determine the facial expression information corresponding to the target object when every two pieces of predicted facial expression information are the same and different from the other two pieces of predicted facial expression information. Thus, the predicted facial expression information can be made more accurate.
Further referring to fig. 3, an application scenario diagram of the facial expression recognition method of the present disclosure is shown.
In the application scenario shown in fig. 3, the photographing apparatus inputs an acquired face image 301 of the user to the server 302. After receiving the facial image 301, the server 302 may input the facial image 301 to the facial expression recognition model a, the facial expression recognition model B, and the facial expression recognition model C, respectively, to obtain facial expression information "smile" corresponding to the facial expression recognition model a, facial expression information "smile" corresponding to the facial expression recognition model B, and facial expression information "pass" corresponding to the facial expression recognition model C, respectively. Then, the server 302 counts 2 output results corresponding to the facial expression information "smile" and 1 output result corresponding to the facial expression information "hard pass" among the obtained output results. Thus, the server 302 may present "smiling" as the user's facial expression as the facial avatar 301.
According to the facial expression recognition method provided by the embodiment of the disclosure, a plurality of facial expression recognition models are trained, a facial image including a target object is input into the facial expression recognition models, a plurality of facial expression recognition results are obtained, then the plurality of facial expression recognition results are counted, and facial expression information corresponding to the target object is determined.
In some optional implementations of the present disclosure, the preset number of facial expression recognition models are obtained by training based on a training sample set. With further reference to fig. 4, a flow 400 of an alternative embodiment of a manner of training a facial expression recognition model according to the present disclosure is shown. The preset number of facial expression recognition models are obtained by training through the following steps:
step 401, obtaining a preset number of training sample sets.
Here, the number of training sample sets acquired is the same as the number of facial expression recognition models to be trained. That is, the training sample set and the facial expression recognition model are in a one-to-one correspondence relationship. The training samples of each of the preset number of training sample sets include sample face images and labeling information for labeling the sample face images. Here, the annotation information is used to indicate facial expression information corresponding to the sample face image.
It is to be noted that, the labeling information of the sample face image included in each of the preset number of training sample sets is labeled based on a different labeling manner. Specifically, the different labeling modes can be labeled by different workers based on subjective judgment. That is, the labeling information of the face images in each training sample set is labeled by different users based on subjective judgment. Thus, different person labels will yield different labeling results for similar facial expressions. Therefore, the labeling of the facial expressions is more comprehensive.
Here, the facial expression information includes, but is not limited to, crying, laughing, hurdling, surprise, frightening, willingness, frown, gladiolus, and the like.
Step 402, training an initial facial expression recognition model to be trained by taking sample facial images in a preset number of training sample sets as input and taking labeling information corresponding to the sample facial images as expected output, so as to generate the trained initial facial expression recognition model.
Specifically, the execution subject or other electronic devices may use a machine learning method to input the sample facial images in the preset number of training sample sets as the initial model, use the label information corresponding to the input sample facial images as the expected output of the initial facial expression recognition model, train the initial facial expression model, and finally train to obtain the facial expression recognition model. Here, various existing convolutional neural network structures may be used as the initial model for training. It should be noted that the executing entity or other electronic devices may also use other models with image processing functions as the initial facial expression recognition model, and the model is not limited to the convolutional neural network, and the specific model structure may be set according to actual requirements, and is not limited herein.
In some optional implementation manners, training an initial facial expression recognition model to be trained by taking sample facial images in a preset number of training sample sets as input and taking annotation information corresponding to the sample facial images as expected output to generate the trained initial facial expression recognition model may specifically include the following steps:
step 4021, inputting sample facial images in a preset number of training sample sets to a feature extraction layer of an initial facial expression recognition model to be trained to obtain image features.
Here, the initial facial expression recognition model may be a neural network. The feature extraction layer may include a convolutional layer, a pooling layer, and the like. The image features may include features indicating the location of facial eyebrows, eyes, mouth shape, facial muscles, etc. that are presented.
Step 4022, inputting the obtained image features into a sub-network of the initial facial expression recognition model to be trained, so as to generate a probability value indicating that the facial expression presented by the sample facial image is the labeled facial expression.
Specifically, facial expression information of a plurality of categories is preset in the initial facial expression recognition model to be trained. The sub-network of the initial facial expression recognition model to be trained may be a full connection layer, may be a classification network, or the like. After the obtained image features are input to the sub-network of the initial facial expression recognition model to be trained, probability values indicating facial expression information of respective categories stored in advance may be obtained based on the image features. A probability value indicating that the facial expression presented by the sample facial image is the annotated facial expression may thus be determined.
Step 4023, determining whether the preset loss function converges based on the obtained probability value corresponding to the sample face image.
Specifically, the preset loss function may be a softmax loss function. The resulting probability value can be substituted into the softmax loss function to determine whether the softmax loss function converges. Here, the convergence is to reach a preset loss value by a preset loss function.
Step 4024, determining that the training of the initial facial expression recognition model is completed in response to determining that the preset loss function is converged.
Step 4025, in response to the fact that the preset loss function is determined not to be converged, adjusting parameters of the initial facial expression recognition model to be trained, and continuing to execute the training steps shown in steps 4021 to 4024 by using a back propagation algorithm.
Here, adjusting parameters of the initial facial expression recognition model to be trained may specifically include adjusting the number of convolution layers, adjusting the size of a convolution kernel, adjusting the step size of the convolution kernel, and the like.
Step 403, for a training sample set in a preset number of training sample sets, adjusting the initial facial expression recognition model after training by taking the sample facial image in the training sample set as input and taking the label information corresponding to the sample facial image as expected output, and generating a facial expression recognition model corresponding to the training sample set as one of the preset number of facial expression recognition models based on the adjustment result.
Here, after the initial facial expression recognition model is obtained based on step 402, the initial facial expression recognition model is more generalized because the initial facial expression recognition model is trained based on all training samples in the preset number of training sample sets. In order to make the face recognition model recognize more accurately, more detailed features can be learned. For each training sample set, the initial facial expression recognition model can be adjusted by using the training sample set, so that a facial expression recognition model trained based on the training sample is obtained. Here, the adjusting that the initial facial expression is modeled may specifically include adjusting the number of convolution layers, convolution kernel step size, and the like. Therefore, based on each training sample, the facial expression recognition models corresponding to the training sample, that is, the facial expression recognition models of the preset number, can be obtained.
By using the preset number of facial expression recognition models obtained by training in the training manner shown in fig. 4, more facial features presented by facial images can be recognized, so that the recognition result is more accurate.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a facial expression recognition apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the present embodiment provides a facial expression recognition apparatus 500 including an input unit 501, a first determination unit 502, and a second determination unit 503. The input unit 501 is configured to obtain a facial expression recognition result of the target object output by each facial expression recognition model by inputting the acquired facial image of the target object to a preset number of facial expression recognition models trained in advance, wherein the preset number of facial expression recognition results are used for indicating facial expression information of the target object presented by the facial image; a first determination unit 502 configured to count the obtained facial expression recognition result; a second determination unit configured to determine facial expression information corresponding to the target object based on the statistical result.
In the present embodiment, in the facial expression recognition apparatus 500: the specific processing of the input unit 501, the first determining unit 502, and the second determining unit 503 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementations of the present embodiment, the preset number of facial expression recognition models are obtained by training through the following steps: acquiring a preset number of training sample sets, wherein training samples in the preset number of training sample sets comprise sample facial images and marking information for marking the sample facial images, and the marking information is used for indicating facial expression information corresponding to the sample facial images; training an initial facial expression recognition model to be trained by taking sample facial images in a preset number of training sample sets as input and taking marking information corresponding to the sample facial images as expected output so as to form the trained initial facial expression recognition model; for a training sample set in a preset number of training sample sets, adjusting the trained initial facial expression recognition model by taking a sample facial image in the training sample set as input and taking marking information corresponding to the sample facial image as expected output; and generating a facial expression recognition model corresponding to the training sample set as one of the preset number of facial expression recognition models based on the adjustment result.
In some optional implementation manners of this embodiment, training the initial facial expression recognition model to be trained by taking sample facial images in a preset number of training sample sets as input and taking annotation information corresponding to the sample facial images as expected output to generate the trained initial facial expression recognition model includes: the following training steps are performed: inputting sample facial images in a preset number of training sample sets to a feature extraction layer of an initial facial expression recognition model to be trained to obtain image features; inputting the obtained image features into a sub-network of an initial facial expression recognition model to be trained to generate a probability value for indicating that the facial expression presented by the sample facial image is the labeled facial expression; determining whether a preset loss function converges based on the obtained probability value corresponding to the sample face image; in response to determining that the preset loss function converges, it is determined that the initial facial expression recognition model training is complete.
In some optional implementations of the present embodiment, the facial expression recognition apparatus 500 further includes: an adjusting unit (not shown in the figures) configured to adjust parameters of the initial facial expression recognition model to be trained in response to determining that the preset loss function is not converged, and to continue performing the training step using a back propagation algorithm.
In some optional implementations of the present embodiment, the labeling information of the sample face images included in each of the preset number of training sample sets is labeled based on a different labeling manner.
The facial expression recognition device provided by the embodiment of the disclosure inputs a facial image including a target object into the facial expression recognition models by training the plurality of facial expression recognition models to obtain a plurality of facial expression recognition results, and then counts the plurality of facial expression recognition results to determine facial expression information corresponding to the target object.
Referring now to fig. 6, shown is a schematic diagram of an electronic device (e.g., terminal device in fig. 1) 600 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the use range of the embodiments of the present disclosure.
As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be included in the terminal device; or may exist separately without being assembled into the terminal device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the facial expression recognition method comprises the steps of obtaining facial expression recognition results of a target object output by each facial expression recognition model by inputting an acquired facial image of the target object to a preset number of facial expression recognition models trained in advance, wherein the preset number of facial expression recognition results are used for indicating facial expression information of the target object presented by the facial image; the obtained facial expression recognition results are counted, and facial expression information corresponding to the target object is determined based on the statistical results.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a processor including an input unit, a first determination unit, and a second determination unit. Here, the names of these units do not constitute a limitation on the unit itself in some cases, and for example, the input unit may also be described as a "unit that inputs the acquired face image of the target subject to a preset number of facial expression recognition models trained in advance".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.