WO2020244434A1 - 面部表情的识别方法、装置、电子设备及存储介质 - Google Patents

面部表情的识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2020244434A1
WO2020244434A1 PCT/CN2020/092593 CN2020092593W WO2020244434A1 WO 2020244434 A1 WO2020244434 A1 WO 2020244434A1 CN 2020092593 W CN2020092593 W CN 2020092593W WO 2020244434 A1 WO2020244434 A1 WO 2020244434A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
neural network
network model
facial
Prior art date
Application number
PCT/CN2020/092593
Other languages
English (en)
French (fr)
Inventor
樊艳波
张勇
李乐
吴保元
李志锋
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020244434A1 publication Critical patent/WO2020244434A1/zh
Priority to US17/473,887 priority Critical patent/US20210406525A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a facial expression recognition method, device, electronic equipment, and computer-readable storage medium.
  • Artificial Intelligence is a comprehensive technology of computer science. Through studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive subject, covering a wide range of fields, such as natural language processing technology and machine learning/deep learning. With the development of technology, artificial intelligence technology will be applied in more fields and will be used more and more. The more important value.
  • facial expression recognition technology facial expressions can be recognized. However, the accuracy of facial expression recognition is relatively low.
  • the embodiments of the present application provide a facial expression recognition method, device, electronic device, and computer-readable storage medium, which can improve the accuracy of recognizing facial expression types.
  • the embodiment of the present application provides a facial expression recognition method, the method is executed by an electronic device, and the method includes:
  • the first expression type of the subject's face in the first image is determined by the fusion feature.
  • An embodiment of the application provides a facial expression recognition device, including: a recognition unit configured to extract a first feature from color information of pixels in a first image;
  • the first expression type of the subject's face in the first image is determined by the fusion feature.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium includes a stored program, and the above-mentioned facial expression recognition method is executed when the program runs.
  • An embodiment of the present application provides an electronic device, including a memory and a processor; wherein the memory is used to store a computer program; the processor is used to run the computer program in the memory, and perform the aforementioned facial expression recognition through the computer program method.
  • the first feature is extracted according to the color information of the pixel in the first image
  • the second feature of the facial key point is extracted from the first image
  • the first feature and the second feature are used to determine the object in the first image
  • FIG. 1 is a schematic diagram of a hardware environment of a facial expression recognition method according to an embodiment of the present application
  • FIGS. 2A-2C are flowcharts of a facial expression recognition method according to an embodiment of the present application.
  • Fig. 3 is a schematic diagram of an application scenario of a facial expression recognition method according to an embodiment of the present application
  • FIG. 4 is a schematic diagram of an application scenario of a facial expression recognition method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an application scenario of a facial expression recognition method according to an embodiment of the present application.
  • 6A is a schematic structural diagram of a neural network model of an embodiment of the present application.
  • FIG. 6B is a schematic diagram of a facial expression recognition framework according to an embodiment of the present application.
  • Fig. 7 is a schematic diagram of facial key points in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a facial map network structure of an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a facial expression recognition device according to an embodiment of the present application.
  • FIG. 10 is a structural block diagram of a terminal according to an embodiment of the present application.
  • RGB color mode a color standard in the industry, is obtained by changing the three color channels of red (R), green (G), and blue (B) and superimposing them with each other.
  • RGB is the color representing the three channels of red, green and blue.
  • YUV a color coding method
  • YUV a color coding method
  • CMYK a color register mode used in color printing
  • the accuracy of facial expression recognition is affected due to the following reasons: one is that different people’s facial expressions change differently, and facial expressions will vary according to different people’s expressions. Differences result from differences; the second is that the expression of the same person is real-time in real life, and the expressions of the same person in different time periods, different scenes and other conditions are also different; the third is the influence of external conditions , Such as background, illumination, angle, distance, etc. have a great influence on expression recognition.
  • the embodiment of the present application provides an embodiment of a facial expression recognition method.
  • the electronic device for facial expression recognition provided in the embodiments of the present application may be various types of terminal devices or servers. Taking a server as an example, it may be a server cluster deployed in the cloud to open cloud services to users, in which faces are encapsulated. Facial expression recognition program.
  • the server deployed in the cloud calls the packaged facial expression recognition program to accurately recognize the facial expression from the first image (the embodiment of this application is not limited to people Facial expressions of animals, cartoon characters, etc.), and apply the recognized facial expressions to the fields of human-computer interaction, autonomous driving, public safety monitoring, and medical health to optimize People’s quality of life, for example, in the field of human-computer interaction, after a machine recognizes an expression in a human face image, it can respond accordingly to achieve barrier-free communication between humans and machines.
  • the packaged facial expression recognition program to accurately recognize the facial expression from the first image
  • the aforementioned facial expression recognition method can be applied to a hardware environment composed of a terminal 101 and a server 103 as shown in FIG. 1.
  • the server 103 is connected to the terminal 101 through the network, and can be used to provide services (such as game services, application services, etc.) for the terminal or the client installed on the terminal.
  • the database 105 can be set on the server or independently of the server. , Is used to provide data storage services for the server 103.
  • the aforementioned networks include but are not limited to: wide area networks, metropolitan area networks, or local area networks.
  • the terminal 101 is not limited to PCs, mobile phones, tablet computers, etc.
  • the facial expression recognition method in the embodiment of the present application can be executed by the server 103, the terminal 101, or the server 103 and the terminal 101 together. That is, the electronic device used for facial expression recognition can be the terminal 101. Or server 103. Wherein, the terminal 101 executing the facial expression recognition method of the embodiment of the present application may also be executed by a client installed on it.
  • the electronic device for facial expression recognition executes the facial expression recognition method provided in the embodiments of the present application as an example of an application scenario where the terminal 101 (electronic device for facial expression recognition) recognizes facial expressions.
  • the terminal 101 locally executes the facial expression recognition method provided in the embodiment of this application to complete the facial expression recognition of the first image.
  • an expression recognition application Application, APP
  • APP Application, APP
  • the terminal 101 uses the neural network model to extract the first feature from the color information of the pixels in the first image, and extract the second feature of the facial key points from the first image, according to the fusion feature of the first feature and the second feature
  • the expression type of the subject's face in the first image is determined, and the expression type of the first image is displayed on the display interface of the terminal 101.
  • the terminal 101 may also send the first image input by the user on the terminal 101 to the server 103 in the cloud through the network, and call the facial expression recognition function (packaged facial expression recognition program) provided by the server 103.
  • the facial expression recognition function packetaged facial expression recognition program
  • the facial expression of the first image is recognized.
  • an expression recognition application is installed on the terminal 101.
  • the user inputs the first image
  • the terminal 101 sends the The server 103 sends the first image.
  • the server 103 After the server 103 receives the first image, it calls the packaged facial expression recognition program, and extracts the first feature from the color information of the pixels in the first image through the neural network model.
  • Extract the second feature of the key points of the face in the fusion feature determine the facial expression type of the subject in the first image according to the fusion feature of the first feature and the second feature, and feed back the expression type of the first image to the facial expression recognition application of the terminal 101, Or, the server 103 directly gives the expression type of the first image.
  • Fig. 2A is a flowchart of a facial expression recognition method according to an embodiment of the present application.
  • the description is given by taking the server as the execution subject as an example.
  • the method may include the following steps (wherein, step S202 and step S206 are adaptive selection steps):
  • Step S202 The server obtains a recognition request from the terminal, and the recognition request is used to request recognition of the facial expression type of the subject in the first image.
  • the objects here are objects with expressions, such as humans, orangutans, etc.
  • the following unified descriptions are based on humans.
  • Facial expression recognition has been developed and applied more and more in the fields of human-computer interaction, automatic driving and medical health.
  • it is used to realize human-computer interaction, automatic driving, medical and health detection terminals It can collect the first image of the target object (such as the user, driver, passerby, patient, etc.) and initiate a recognition request to recognize the expression type.
  • the expression type here can be angry, sad, disgusted, afraid, surprised, happy, normal Wait for the expression.
  • step S204 the server extracts the first feature from the color information of the pixels in the first image, and extracts the second feature of the key points of the face from the first image, and performs fusion processing on the first feature and the second feature to obtain the fusion feature.
  • the fusion feature determines the first expression type of the subject's face in the first image.
  • the embodiments of this application are not limited to neural network models, and other machine learning models are also applicable to the embodiments of this application.
  • the color coding of the pixels in the above-mentioned first image can be one of RGB, YUV, CMYK and other color coding modes.
  • RGB is used as an example for description.
  • the rest of the color coding modes are similar to this and will not be repeated. .
  • the above-mentioned first feature is the extracted texture feature related to the expression
  • the second feature is the facial component (such as at least one of the facial features of the human face), facial contour feature, and the facial key points are the description of the facial component and/or Feature points of facial contours.
  • the learning of the neural network model you can learn the commonalities between different objects and the same object in facial expression expression, through the first feature (which can accurately represent the facial texture of the object) and the second feature (which can be used to represent various parts of the face)
  • the relationship between the first feature and the second feature learned in advance and the facial expression classification can be used to accurately identify the facial expression of the current object.
  • the fusion of the second feature and the first feature can also be used to avoid incorrect recognition caused by the use of the first feature alone (the aforementioned unfavorable factors may lead to inaccurate extraction of the first feature).
  • the first feature and the second feature can be fused to obtain the fusion feature, and the object in the first image can be determined by the fusion feature The first facial expression type.
  • the fusion process can be based on the weights of the first feature and the second feature, performing a weighted summation of the first feature and the second feature, and using the result of the weighted summation as the fusion feature to realize the first feature and the second feature Feature fusion; or, perform linear/non-linear mapping of the first feature and the second feature, and join the first feature and the second feature after linear/non-linear mapping to realize the features of the first feature and the second feature Fusion.
  • the neural network model is used to identify the first expression type from the first image, and the neural network model is used to extract the first feature according to the color information of the pixels in the first image, and extract facial key points from the first image
  • the second feature and the first feature and the second feature are used to determine the first expression type of the face of the object in the first image.
  • Step S206 in response to the recognition request, the server returns the recognized first expression type to the terminal.
  • the server can accurately recognize the facial expression from the first image through the neural network model, and return the facial expression to the terminal.
  • the following describes the training of the neural network model, as follows:
  • the embodiment of the application provides a multi-modal facial expression recognition scheme based on encoded images (such as RGB images) and facial key points Landmark (such as face key points), as shown in FIG. 6A, the neural network in the scheme
  • the model includes successively connected convolutional neural networks (Convolution Neural Networks, CNN) for extracting the first feature (texture feature) of an image, and Graph Neural Networks (Graph Neural Networks) for extracting the second feature of facial key points. , GNN), fusion layer and classification network (can include fully connected layer and classification layer).
  • This program uses convolutional neural network to model and learn RGB images, and graph neural network to model and learn key points of the face, and The features of the two modalities (RGB image and face key points) are merged through the fusion layer to obtain fusion features, and facial expressions are obtained by facial expression recognition based on the fusion features through the classification network. Modeling the correlation and complementarity between key points can achieve more robust facial expression recognition.
  • the graph neural network can more flexibly and efficiently describe the association between key points of the face, and can extract the discrimination ability Stronger key features of the face.
  • the embodiments of the present application are not limited to convolutional neural networks and graph neural networks, and other models may also be used to implement feature extraction of RGB images and key points of human faces. The technical solution of the present application will be described in detail below in conjunction with the steps shown in FIG. 2A.
  • Figure 2B shows that before step S202 is performed in Figure 2A, the neural network model can be pre-trained in the following manner, that is, the training set is input to the neural network model, and the neural network model outputs the predicted result. If there is an error with the actual result, the error between the expected result and the actual result is calculated, and the error is propagated back in the neural network model to adjust the values of the parameters of all layers in the neural network model, including all layers Product neural network, graph neural network, fusion layer and classification network; continuously iterate the above process until convergence to complete the training of the neural network model:
  • Step S11 Obtain a training set, where the training images in the training set are identified with expression types and the color coding type of the training images is the same as the first image.
  • a data set (such as the AffectNet face expression data set) can be obtained in advance, and the images in the data set can be divided into a training set and a test set.
  • the division method can be random division to facilitate the images in the training set and the test set.
  • the features of, keep the same or basically the same distribution.
  • the proportion of pictures the number of pictures in the training set is generally greater than that in the test set. For example, the pictures in the training set account for 80% of the data set, and the test set accounts for 20% of it.
  • Step S12 the training image in the training set is used as the input of the neural network model, and the neural network model is trained to obtain the initial neural network model.
  • the initial neural network model takes the training image in the training set as input and uses the expression type identified by the training image It is obtained after initializing the weights in the network layer of the neural network model when the output is expected.
  • each neuron has input connections and output connections. These connections simulate the behavior of synapses in the brain, similar to the way that synapses in the brain transmit signals. Signals are transmitted from one neuron to another.
  • Each connection has a weight, that is, the value sent to each connection must be multiplied by this weight. The weight is actually equivalent to the number of neurotransmitters transferred between biological neurons. If a connection is important, it will have Greater weight value than those unimportant connections.
  • the training process is the process of assigning these weights. In this technical solution, supervised learning can be used.
  • the training set includes input (such as RGB encoding of the image and facial image using image data structure) and desired output (ie facial expression type). In this way, the network can check it The difference between the calculated result and the expected output, and take appropriate processing accordingly.
  • Each training image in the training set includes the input value and the expected output.
  • the network calculates the output of one of the inputs (the weight value can be randomly assigned at the beginning), the corresponding error can be calculated according to the error function. This error indicates the model's How close the actual output is to the expected output.
  • the error function used here is the mean square error function, as shown in formula (1):
  • x represents the input in the training set
  • y(x) represents the output generated by the neural network model
  • a represents the expected output
  • the mean square error function is a function of w and b
  • w represents the weight
  • b represents the deviation (biases).
  • Step S13 Obtain the second expression type output by the initial neural network model when the test image in the test set is the input of the initial neural network model.
  • the test image in the test set is identified with the expression type and the color coding type is the same as the first image.
  • Step S14 When the matching accuracy between the second expression type output by the initial neural network model and the expression type identified by the test image in the test set reaches the target threshold, the initial neural network model is used as the trained neural network model.
  • the above matching accuracy rate is obtained by calculating the output of the initial neural network model for multiple test images. For example, if 95 of 100 test images can be correctly identified, the matching accuracy rate is 95%. If the target threshold is 98 %, because the actual correct matching rate is less than the target threshold, indicating that the model is under-fitting, then the initial neural network model needs to be trained. If 99 of the 100 test images can be correctly identified, then the model is relatively mature. Can be put into practical application.
  • Step S15 When the matching accuracy between the second expression type output by the initial neural network model and the expression type identified by the test image in the test set is less than the target threshold, the training image in the training set is used as the input of the initial neural network model, Continue to train the initial neural network model until the matching accuracy between the second expression type output by the initial neural network model and the expression type identified by the test image in the test set reaches the target threshold.
  • the model can be used to recognize facial expression types.
  • the server obtains the recognition request of the terminal, and the recognition request is used to indicate the recognition of the object in the first image Facial expression type.
  • the recognition request may be directly the first image, or a request message that carries indication information of the first image (such as the image logo and storage address of the first image).
  • step S204 determines the first expression type of the subject's face in the first image from the first image. See FIG. 2C, which shows that step 204 in FIG. 2A includes step S2042-step S2044:
  • Step S2042 in the convolutional neural network, use the color information of the pixels in the first image to extract the first feature used to represent the texture in the first image, and in the graph neural network, extract the first feature used to represent the facial key points Associated second feature, where facial key points are used to represent the component parts and/or facial contours of the subject's face.
  • using the color information of the pixels in the first image to extract the first feature used to represent the texture in the first image includes: taking the color coding data of the pixels in the first image (such as the RGB coding data of the first image) as the volume The input of the convolutional neural network.
  • the convolutional neural network is used to perform the convolution operation on the color coding of the pixels in the first image to obtain the first feature, such as the feature of the raised eye when describing a smile; obtain the first output of the convolutional neural network feature.
  • the first image in order to improve the recognition accuracy, the first image may be preprocessed to make its resolution, length and width, and reference points meet the requirements.
  • the color coding of pixels in the first image is used as the convolution
  • the neural network when the position of the reference point in the first image in the first image is different from the position of the reference point in the picture template in the picture template, perform the cropping operation and/or zooming of the first image Operation, such as performing the following operations: move the first image so that its reference point coincides with the reference point of the template, then use the reference point as the origin to zoom so that its resolution is the same as the template, and then crop it to make
  • the length and width are the same as those of the template to obtain the second image.
  • the position of the reference point in the second image in the second image is the same as the position of the reference point in the picture template in the picture template;
  • the color coding is used as the input of the convolutional neural network.
  • multiple third images can be used to determine the key points of the face, the correlation between the key points, and the weight of the correlation between the key points, where the third image is
  • multiple third images can be analyzed to determine the key points around the facial features and facial contours that are directly related to the expression (that is, there is movement when affected by the expression)
  • the amplitude or the point that can reflect the expression), and the key point with the association relationship refers to the point that can produce the linkage under the same expression
  • the association weight is the degree of association between the two key points (for example, it can be based on different It is obtained by processing (such as normalization) after the research of the population; taking the key points of the face as the nodes, connecting the edges between the nodes that represent the relationship between the key points of the face, and there will be
  • the correlation weight between the key points of the correlation relationship is used as the weight of the edge to obtain the first face image.
  • the second face image can be determined according to the first face image, where the first face image includes nodes representing the key points of the face and the nodes located between the nodes. Represents the edges that have an association relationship between the face key points and the associated weights of the edges.
  • the second face image is obtained by adding the position of the face key points corresponding to the nodes in the first image in the first face image; for the second Feature extraction is performed on the facial image to obtain the second feature.
  • Step S2044 In the classification network, the correspondence between the different first and second features and different expression types is learned through pre-training, and the first and second features are identified from the multiple expression types. The first expression type corresponding to the second feature.
  • the facial expression recognition scheme based on RGB images and the facial expression recognition scheme based on key points of the face are used.
  • the expression recognition scheme based on RGB images mainly extracts features related to expressions from the face image (That is, the first feature) and classification, but because RGB images are greatly affected by factors such as illumination changes and occlusion, the robustness of facial expression recognition systems that rely only on RGB image data is poor; expressions based on key points of the face
  • the key points of the face mainly refer to the points where the facial features and contours of the face are located. The position information of these points is closely related to the facial expressions. As the prediction of the key points of the face becomes more and more accurate, the face based on the key points Expression recognition is becoming more and more accurate.
  • Hand-crafted (hand-craft) features can be used in facial expression recognition based on key points, and shallow models can be used for classification, such as support vector machines (SVM)
  • SVM support vector machines
  • the model performs expression classification. Because the key points of the face have rich structural information and the close associations between different key points, the solution can accurately recognize facial expressions, but if the features designed by hand are used, they cannot be flexible and Effectively model the rich and complex associations between different key points, resulting in poor performance of facial expression recognition based on key points.
  • RGB images can obtain richer face texture information, but it is not very robust to changes in lighting, and expression recognition based on key points of the face is more robust to changes in lighting. However, most of the texture information is lost.
  • the fusion of RGB images and key points of the face is very helpful for facial expression recognition.
  • the embodiments of the present application provide a multi-modal facial expression recognition solution based on RGB images and face key points.
  • the solution uses the complementarity of RGB images and face key points to realize more robust facial expression recognition. Aiming at the fact that manual design features cannot efficiently describe the association of key points of the face, this solution uses graph neural network to flexibly and efficiently model the key points of the face.
  • the graph neural network can adaptively learn the associations between key points, which significantly improves Performance of facial expression recognition based on key points.
  • step S206 in response to the recognition request, the server returns the recognized first expression type to the terminal.
  • the feedback information of the terminal can be obtained, the feedback information is used to indicate whether the recognized first expression type is correct; the feedback information indicates that the recognized first expression type is incorrect
  • the fourth image can be an image with the same facial expression type as the first image, or an image with the same background type . Adopting this technical solution is equivalent to improving the identification of weak links of the neural network model.
  • Facial expression recognition has been developed and applied more and more in the fields of human-computer interaction, autonomous driving and medical health.
  • the embodiments of this application can be used to assist robots in recognizing human emotions and psychology, and improve human-computer interaction
  • the user experience in the product is shown in Figure 3.
  • the robot 301 can relieve the emotion of the person by telling jokes, etc., and improve the user experience; the embodiment of the application can also be used in shopping malls and banks.
  • Wait for customer satisfaction analysis use the monitor in the bank service window 401 to capture the customer’s facial expressions during the transaction, and analyze the facial expressions in the surveillance video to determine the customer’s transaction satisfaction in the bank Etc.; the embodiments of this application can also be used for the simulation and generation of animated expressions, such as recognizing the expressions of real human faces and transferring them naturally to the animated images, as shown in Figure 5, when a person is recognized to make a sad expression , The animated image 501 will also present a corresponding sad expression.
  • the technical solution of the present application will be described in detail below in conjunction with the embodiments.
  • the embodiment of the application provides a multi-modal facial expression recognition system based on RGB images and face key points.
  • Figure 6B shows a multi-modal facial expression recognition framework, which can be given an image to be recognized. First, perform face detection and face alignment, and extract the key point information of the face; then use the convolutional neural network to adaptively learn the features of the RGB image, and use the graph neural network to adaptively model the key points of the face. Associate and perform key point feature learning, and the obtained RGB features and key point features are fused together for the final classification; the entire recognition system can achieve end-to-end training and prediction.
  • the aligned face image can be given, and the model first extracts the face key point information from the image, as shown in Figure 7, for example, the key points 701-702 are points representing the contour of the face .
  • Face key points (such as the points shown in numbers 1-68) locate the key areas of the face, such as face contour, eyebrows, eyes, nose and mouth, etc.; when the same person makes different expressions, the position of the key points on the face Usually different, so you can use the key points of the face to assist facial expression recognition.
  • There are usually complex associations between the key points of the face For example, when making a "surprise" expression, the position of the key points near the eyebrows and eyes will usually change together, etc. .
  • the embodiment of the present application uses graph neural network to efficiently model the key points of the face, and regards the face image as a highly structured data.
  • the distribution of facial features composes the key points of the face into a graph network structure, as shown in Figure 8, each vertex in the graph represents a key point of the face, and each edge represents the relationship between the key points of the face, for example, the edge 801 represents the contour
  • f gcn represents the graph neural network. Since the graph adjacency matrix A is a 0-1 matrix, it can only indicate whether there is a correlation between key points, but cannot measure the weight of different edges, and the correlation between different key points is strong. Weaknesses are different from each other.
  • the embodiment of the application introduces a learnable parameter W, and the key point feature learning representation of the face based on the graph neural network is shown in formula (3):
  • It is a modified adjacency matrix with weights, and the weight W is adaptive learning, and Y landmark represents the features obtained from the key points of the face.
  • RGB image features are extracted from RGB images.
  • RGB images are directly obtained from face images after face detection and alignment processing.
  • convolutional neural networks are used in image feature learning and image recognition fields
  • X rgb represent the original RGB input of the image.
  • the RGB image feature expression obtained in the embodiment of the application is as shown in formula (4):
  • f cnn is a convolutional neural network based on RGB images
  • Y rgb represents the learned RGB image features
  • RGB image information and face key point information complement each other.
  • This method combines the learned face key point feature Y landmark and RGB image feature Y rgb to obtain the overall feature Y As shown in formula (5):
  • g represents feature fusion
  • the fully connected network is used to perform expression classification based on the fused feature Y.
  • the entire network structure includes the face key point feature extraction branch f gcn , the RGB image feature extraction branch f cnn, and the fully connected classification network can achieve end-to-end training, and the weighted loss function is minimized during the network training process to ease facial expression recognition Severe category imbalance phenomenon.
  • weighted summation is performed on the key point feature Y landmark of the face and the RGB image feature Y rgb , and the result of the weighted summation is used as the fused feature Y to achieve feature fusion, and the fused feature Y is performed through a fully connected network.
  • Make predictions to get the expression of the face image when the key point feature of the face has a relatively large contribution to facial expression recognition, the weight of the key point feature of the face is relatively large relative to the weight of the RGB image feature Y rgb .
  • this method uses the AffectNet facial expression data set, which contains seven basic facial expressions: anger, disgust, fear, happy, natural, sad, surprised, etc.
  • the distribution of its data is shown in Table 1:
  • Landmark-Linear 11.4 28.4 9.6 67.6 10.2 35.6 39.3 28.9
  • Landmark-SVM 2 7 0.0 0.0 100.0 3.3 2.9 9.8 19.5
  • Table 2 shows the expression recognition model based on the key point feature of the graph neural network ( Landmark-GCN) under the seven expressions of AffectNet, the recognition accuracy rate, the last column is the average recognition accuracy rate.
  • Table 2 also gives the key point facial expression recognition model based on hand-designed features: linear classification model (Landmark-Linear ) And the classification accuracy of the SVM classification model (Landmark-SVM), it can be seen that the key point features of the face extracted by the graph neural network proposed by this method in Table 3 have good discriminability, and the recognition effect is significantly better than Model based on hand-designed features.
  • Table 3 shows the facial expression recognition based on RGB image features, the facial expression recognition based on key points of the graph neural network and the facial expression recognition based on multimodal fusion.
  • This application proposes a face expression recognition method based on multi-modal information fusion. This method also considers the complementary information of the RGB image and the key points of the face, which can significantly improve the accuracy of facial expression recognition.
  • This application is suitable for improving the user experience in human-computer interaction products, assisting shopping malls, banks, etc. to analyze customer satisfaction, and assisting in the simulation and generation of animated expressions.
  • the embodiment of the application constructs a face key point map network structure based on face structure information.
  • the number and position of face key points are not limited to those shown in FIG. 7, and the map network structure of face key points is not limited to that shown in FIG. It is any number of key points and any graph network structure.
  • the embodiment of the present application adopts a convolutional neural network and a graph neural network to model RGB images and key points of a human face respectively, and does not limit a certain type of convolutional neural network or graph neural network.
  • the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a computer-readable storage medium (such as ROM/RAM, A magnetic disk, an optical disk) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
  • FIG. 9 is a schematic diagram of a facial expression recognition apparatus according to an embodiment of the present application.
  • the apparatus may include: a first obtaining unit 901, a recognition unit 903, and a response unit 905 (wherein, the embodiment of the present application may The first acquisition unit 901 and the response unit 905 are selected for adaptation).
  • the first obtaining unit 901 is configured to obtain a recognition request of the terminal, where the recognition request is used to request recognition of the facial expression type of the subject in the first image.
  • the recognition unit 903 is configured to extract the first feature from the color information of the pixel points in the first image; extract the second feature of the facial key points from the first image; perform fusion processing on the first feature and the second feature to obtain the fusion
  • the feature is used to determine the first expression type of the subject's face in the first image through the fusion feature.
  • the response unit 905 is configured to return the recognized first expression type to the terminal in response to the recognition request.
  • first obtaining unit 901 in this embodiment may be configured to perform step S202 in the embodiment of the present application
  • identification unit 903 in this embodiment may be configured to perform step S204 in the embodiment of the present application
  • the response unit 905 in the embodiment may be configured to execute step S206 in the embodiment of the present application.
  • the neural network model is used to identify the first expression type from the first image.
  • the neural network model is used to extract the first feature according to the color information of the pixels in the first image, and extract the facial key points from the first image.
  • the second feature and the first feature and the second feature are used to determine the first expression type of the face of the object in the first image; by fusing image features and facial key point features, the accuracy of recognizing facial expression types can be improved, thereby achieving accurate recognition Technical effects of facial expressions.
  • the recognition unit may include: a processing module configured to extract the first feature representing the texture in the first image from the color information of the pixel in the first image through a convolutional neural network, and use the image neural network Network, extract the second feature used to represent the relationship between the key points of the face, where the key points of the face are used to represent the components of the face of the object and/or the facial contour; through the fusion layer, the first feature and the second feature are characterized Fusion processing to obtain a fusion feature; the recognition module is configured to recognize the first expression type corresponding to the fusion feature from a plurality of expression types through a classification network.
  • the processing module may be further configured to: perform a weighted summation on the first feature and the second feature based on the weights of the first feature and the second feature, and use the result of the weighted summation as the fusion feature; or, The first feature and the second feature are spliced together to obtain a fusion feature.
  • the processing module may also be configured to: use the color coding of the pixels in the first image as the input of the convolutional neural network, where the convolutional neural network is used to perform the color coding of the pixels in the first image Convolution operation to obtain the first feature; obtain the first feature output by the convolutional neural network.
  • the processing module when the processing module uses the color coding of pixels in the first image as the input of the convolutional neural network, it can also be configured to: when the position of the reference point in the first image is in the first image, and When the position of the reference point in the picture template is different in the picture template, the first image is cropped and/or zoomed to obtain the second image, so that the position of the reference point in the second image in the second image.
  • the position of the reference point in the picture template is the same as that of the picture template; the color coding of the pixels in the second image is used as the input of the convolutional neural network.
  • the processing module may also be configured to: add the positions of the key points of the face corresponding to the nodes in the first image in the first face image to obtain a second face image, wherein the first face image includes The nodes of the face key points, the edges between the nodes that represent the relationship between the face key points, and the associated weights of the edges; perform feature extraction on the second face image to obtain the second feature.
  • the processing module may also be configured to determine the key points of the face, the association relationship between the key points, and the association weight between the key points according to a plurality of third images, where the third image is an expression type identified Image; take facial key points as nodes, connect the edges between the nodes that are used to represent the relationship between the key points of the face, and use the correlation weight between the key points with the relationship as the weight of the edge to obtain the first A facial image.
  • the above-mentioned apparatus may further include: a second obtaining unit configured to obtain a training set, wherein the training images in the training set have an expression type and the color coding type is the same as the first image; the training unit is configured to The training image in the training set is used as the input of the neural network model, and the neural network model is trained to obtain the initial neural network model.
  • a second obtaining unit configured to obtain a training set, wherein the training images in the training set have an expression type and the color coding type is the same as the first image
  • the training unit is configured to The training image in the training set is used as the input of the neural network model, and the neural network model is trained to obtain the initial neural network model.
  • the initial neural network model takes the training image in the training set as input, and uses the expression type identified by the training image as the prediction When outputting, it is obtained after initializing the weights in the network layer of the neural network model;
  • the third acquisition unit is configured to acquire the second expression type output by the initial neural network model when the test image in the test set is used as the input of the initial neural network model, Wherein, the test image in the test set is identified with an expression type and the color coding type is the same as the first image;
  • the determining unit is configured to be configured as the second expression type output by the initial neural network model and the expression type identified by the test image in the test set When the matching accuracy between the two reaches the target threshold, the initial neural network model is used as the trained neural network model; wherein, the training unit is also configured as the second expression type output by the initial neural network model and the test image identification in the test set When the matching accuracy between the expression types of is less than the target threshold, the training image in the training set is used as the input of the initial neural network model, and the initial
  • the foregoing apparatus may further include: a feedback unit configured to obtain feedback information, where the feedback information is used to indicate whether the recognized first expression type is correct; when the feedback information indicates that the recognized first expression type is not When it is correct, use the fourth image with the same image characteristics as the first image to train the neural network model.
  • a feedback unit configured to obtain feedback information, where the feedback information is used to indicate whether the recognized first expression type is correct; when the feedback information indicates that the recognized first expression type is not When it is correct, use the fourth image with the same image characteristics as the first image to train the neural network model.
  • the above-mentioned modules can run in the hardware environment as shown in FIG. 1, and can be implemented by software or hardware, where the hardware environment includes a network environment.
  • the embodiment of the present application provides a server or terminal for implementing the above facial expression recognition method.
  • FIG. 10 is a structural block diagram of a terminal according to an embodiment of the present application.
  • the terminal may include: one or more (only one is shown in FIG. 10) processor 1001, memory 1003, and transmission device 1005
  • the terminal may also include an input and output device 1007.
  • the memory 1003 can be used to store software programs and modules, such as the facial expression recognition method and device corresponding program instructions/modules in the embodiments of the present application.
  • the processor 1001 runs the software programs and modules stored in the memory 1003, thereby Perform various functional applications and data processing, that is, realize the above-mentioned facial expression recognition method.
  • the memory 1003 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 1003 may include a memory remotely provided with respect to the processor 1001, and these remote memories may be connected to the terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the aforementioned transmission device 1005 is used to receive or send data via a network, and can also be used to transmit data between the processor and the memory.
  • the foregoing network examples may include wired networks and wireless networks.
  • the transmission device 1005 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through a network cable so as to communicate with the Internet or a local area network.
  • the transmission device 1005 is a radio frequency (RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF radio frequency
  • the memory 1003 is used to store application programs.
  • the processor 1001 may call the application program stored in the memory 1003 through the transmission device 1005 to perform the following steps:
  • the recognized first expression type is returned to the terminal.
  • the processor 1001 is further configured to perform the following steps:
  • the training image in the training set is used as the input of the neural network model, and the neural network model is trained to obtain the initial neural network model.
  • the initial neural network model takes the training image in the training set as input and uses the expression type identified by the training image as the prediction When outputting, it is obtained after initializing the weights in the network layer of the neural network model;
  • the initial neural network model is used as the trained neural network model
  • the training image in the training set is used as the input of the initial neural network model, and the initial neural network model The network model is trained until the accuracy of matching between the second expression type output by the initial neural network model and the expression type identified by the test image in the test set reaches the target threshold.
  • the recognition request of the terminal is obtained, and the recognition request is used to request the recognition of the facial expression type of the object in the first image;
  • the neural network model is used to recognize the first facial expression type from the first image, and the neural network model is used according to Extracting the first feature from the color information of the pixels in the first image, extracting the second feature of the key points of the face from the first image, and using the first feature and the second feature to determine the first expression type of the subject's face in the first image;
  • the recognized first expression type is returned to the terminal, and the accuracy of recognizing facial expression types can be improved by fusing image features and facial key point features, thereby achieving the technical effect of accurately recognizing facial expressions.
  • the structure shown in FIG. 10 is only for illustration, and the terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a handheld computer, and a mobile Internet Device (MID), Terminal equipment such as PAD.
  • FIG. 10 does not limit the structure of the above electronic device.
  • the terminal may also include more or fewer components (such as a network interface, a display device, etc.) than shown in FIG. 10, or have a different configuration from that shown in FIG.
  • the program can be stored in a computer-readable storage medium, which can be readable by the computer.
  • the storage medium may include a flash disk, a read-only memory (Read-Only Memory, ROM), a random access device (Random Access Memory, RAM), a magnetic disk or an optical disk, etc.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the above-mentioned computer-readable storage medium may be used to execute the program code of the facial expression recognition method.
  • the foregoing storage medium may be located on at least one of the multiple network devices in the network shown in the foregoing embodiment.
  • the computer-readable storage medium is configured to store program code for executing the following steps:
  • the recognized first expression type is returned to the terminal.
  • the computer-readable storage medium is also configured to store program code for executing the following steps:
  • the training image in the training set is used as the input of the neural network model, and the neural network model is trained to obtain the initial neural network model.
  • the initial neural network model takes the training image in the training set as input and uses the expression type identified by the training image as the prediction When outputting, it is obtained after initializing the weights in the network layer of the neural network model;
  • the initial neural network model is used as the trained neural network model
  • the training image in the training set is used as the input of the initial neural network model, and the matching is continued.
  • the initial neural network model is trained until the accuracy of matching between the second expression type output by the initial neural network model and the expression type identified by the test image in the test set reaches the target threshold.
  • the above-mentioned computer-readable storage medium may include, but is not limited to: U disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, Various media that can store program codes such as magnetic disks or optical disks.
  • the integrated unit in the foregoing embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the foregoing computer-readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, A number of instructions are included to enable one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the disclosed client can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the first feature is extracted from the color information of the pixel in the first image by the electronic device
  • the second feature of the key point of the face is extracted from the first image
  • the fusion is performed according to the first feature and the second feature.
  • the obtained fusion feature determines the first expression type of the subject's face in the first image. In this way, the accuracy of recognizing facial expression types is improved, thereby achieving the purpose of accurately recognizing facial expressions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本申请公开了一种面部表情的识别方法、装置、电子设备及存储介质。其中,该方法包括:从第一图像中像素点的颜色信息中提取第一特征;从第一图像中提取面部关键点的第二特征;将第一特征和第二特征进行融合处理,得到融合特征,通过融合特征确定第一图像中对象面部的第一表情类型。

Description

面部表情的识别方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请基于申请号为201910478195.3、申请日为2019年06月03日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及人工智能领域,尤其涉及一种面部表情的识别方法、装置、电子设备及计算机可读存储介质。
背景技术
人工智能(Artificial Intelligence,AI)是计算机科学的一个综合技术,通过研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科,涉及领域广泛,例如自然语言处理技术以及机器学习/深度学习等几大方向,随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
人的情感的产生是一个很复杂的心理过程,情感的表达也伴随多种表现方式,常被计算机学家用于研究的表达方式主要有三种:表情、语音、动作。在这三种情感表达方式中,表情所贡献的情感比例高达55%,随着人机交互技术的应用日益广泛,在人机交互、自动驾驶和医疗健康等领域中,人脸表情识别技术具有非常重要的意义。
将人工智能技术应用于人脸表情识别技术,在人脸表情识别技术中,可以识别出人脸表情。但是人脸表情识别的准确性比较低。
发明内容
本申请实施例提供了一种面部表情的识别方法、装置、电子设备及计算机可读存储介质,能够提高识别人脸表情类型的准确度。
本申请实施例提供了一种面部表情的识别方法,所述方法由电子设备执行,所述方法包括:
从第一图像中像素点的颜色信息中提取第一特征;
从所述第一图像中提取面部关键点的第二特征;
将所述第一特征和所述第二特征进行融合处理,得到融合特征;
通过所述融合特征确定所述第一图像中对象面部的第一表情类型。
本申请实施例提供了一种面部表情的识别装置,包括:识别单元,配置为从第一图像中像素点的颜色信息中提取第一特征;
从所述第一图像中提取面部关键点的第二特征;
将所述第一特征和所述第二特征进行融合处理,得到融合特征;
通过所述融合特征确定所述第一图像中对象面部的第一表情类型。
本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质包括存储的程序,程序运行时执行上述的面部表情的识别方法。
本申请实施例提供了一种电子设备,包括存储器以及处理器;其中,所述存储器用于存储计算机程序;所述处理器用于运行存储器中的计算机程序,通过计算机程序执行上述的面部表情的识别方法。
在本申请实施例中,根据第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征以及利用第一特征和第二特征确定第一图像中对象面部的第一表情类型,通过融合图像像素点特征和面部关键点特征,可以提高识别人脸表情类型的准确度,进而达到准确识别面部表情的技术效果。
附图说明
图1是本申请实施例的面部表情的识别方法的硬件环境的示意图;
图2A-2C是本申请实施例的面部表情的识别方法的流程图;
图3是本申请实施例的面部表情的识别方法的应用场景的示意图;
图4是本申请实施例的面部表情的识别方法的应用场景的示意图;
图5是本申请实施例的面部表情的识别方法的应用场景的示意图;
图6A是本申请实施例的神经网络模型的结构示意图;
图6B是本申请实施例的人脸表情识别框架的示意图;
图7是本申请实施例的面部关键点的示意图;
图8是本申请实施例的面部图网络结构的示意图;
图9是本申请实施例的面部表情的识别装置的示意图;
图10是本申请实施例的一种终端的结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这 些过程、方法、产品或设备固有的其它步骤或单元。
首先,在对本申请实施例进行描述的过程中出现的部分名词或者术语适用于如下解释:
1)RGB色彩模式,工业界的一种颜色标准,是通过对红(R)、绿(G)、蓝(B)三个颜色通道的变化以及它们相互之间的叠加来得到各式各样的颜色的,RGB即是代表红、绿、蓝三个通道的颜色。
2)YUV,一种颜色编码方法,适用于各个视频处理组件中,YUV在对照片或视频编码时,考虑到人类的感知能力,允许降低色度的带宽。“Y”表示明亮度(Luminance,Luma),“U”和“V”则表示色度、浓度(Chrominance,Chroma)。
3)印刷四色模式CMYK,彩色印刷时采用的一种套色模式,利用色料的三原色混色原理,加上黑色油墨,共计四种颜色混合叠加,形成“全彩印刷”,四种标准颜色分别是:C:Cyan=青色,又称为“天蓝色”或“湛蓝”;M:Magenta=品红色,又称为“洋红色”;Y:Yellow=黄色;K:blacK=黑色。
在本申请实施例的实施过程中发现,由于以下原因的存在,从而影响到人脸表情识别的准确性:其一是不同的人表情变化不同,人脸表情会根据不同的人的表现方式的区别而产生差异性;其二是同一个人的表情在现实生活中具有实时性,同一人在不同的时间段、不同的场景等条件下产生的表情也不同;其三是受外界的条件的影响,如背景、光照、角度、距离等对表情识别影响较大。
为了解决上述问题,本申请实施例提供一种面部表情的识别方法的实施例。
下面说明本申请实施例提供的用于面部表情识别的电子设备的示例性应用。本申请实施例提供的用于面部表情识别的电子设备可以是各种类型 的终端设备或服务器,以服务器为例,例如可以是部署在云端的服务器集群,向用户开放云服务,其中封装有面部表情识别的程序。用户在开放的云服务中输入第一图像后,部署在云端的服务器调用封装的面部表情识别的程序,从第一图像中准确地识别出面部图像的表情(本申请实施例并不局限于人的脸部表情,也可以是动物、卡通人物等的脸部表情),并将识别出的脸部图像的表情应用于人机交互、自动驾驶、公共安全监控和医疗健康等领域中,以优化人们的生活品质,例如,在人机交互领域中,机器在识别出人的面部图像的表情后,可以根据该表情进行相应的应答,实现人与机器的无障碍沟通。
为了便于理解本申请实施例提供的技术方案,下面结合用于面部表情识别的电子设备,对本申请实施例提供的面部表情的识别方法的应用场景进行介绍。例如,上述面部表情的识别方法可以应用于如图1所示的由终端101和服务器103所构成的硬件环境中。如图1所示,服务器103通过网络与终端101进行连接,可用于为终端或终端上安装的客户端提供服务(如游戏服务、应用服务等),可在服务器上或独立于服务器设置数据库105,用于为服务器103提供数据存储服务,上述网络包括但不限于:广域网、城域网或局域网,终端101并不限定于PC、手机、平板电脑等。本申请实施例的面部表情的识别方法可以由服务器103来执行,也可以由终端101来执行,还可以是由服务器103和终端101共同执行,即用于面部表情识别的电子设备可以为终端101或者服务器103。其中,终端101执行本申请实施例的面部表情的识别方法也可以是由安装在其上的客户端来执行。
在一些实施例中,用于面部表情识别的电子设备执行本申请实施例提供的面部表情的识别方法,作为终端101(用于面部表情识别的电子设备)识别面部表情的应用场景示例。终端101本地执行本申请实施例提供的面部表情的识别方法,来完成识别第一图像的面部表情,例如,在终端101 上安装表情识别应用(Application,APP),用户在表情识别APP中输入第一图像后,终端101通过神经网络模型从第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征,根据第一特征和第二特征的融合特征确定第一图像中对象面部的表情类型,并将第一图像的表情类型显示在终端101的显示界面上。
在一些实施例中,终端101也可以通过网络向云端的服务器103发送用户在终端101上输入的第一图像,并调用服务器103提供的面部表情识别功能(封装的面部表情识别的程序),服务器103通过本申请实施例提供的面部表情的识别方法,识别第一图像的面部表情,例如,在终端101上安装表情识别应用,用户在表情识别应用中,输入第一图像,终端101通过网络向服务器103发送该第一图像,服务器103接收到该第一图像后,调用封装的面部表情识别的程序,通过神经网络模型从第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征,根据第一特征和第二特征的融合特征确定第一图像中对象面部的表情类型,并将第一图像的表情类型反馈至终端101的表情识别应用中,或者,服务器103直接给出第一图像的表情类型。
图2A是根据本申请实施例的一种面部表情的识别方法的流程图。举例来说,是以服务器为执行主体为例进行描述。如图2A所示,该方法可以包括以下步骤(其中,步骤S202和步骤S206为适应性的选用步骤):
步骤S202,服务器获取终端的识别请求,识别请求用于请求识别第一图像中对象面部的表情类型。
此处的对象为具备表情展现的对象,如人类、猩猩等,为了描述的统一,后续统一以人类为例进行描述。
人脸表情识别在人机交互、自动驾驶和医疗健康等领域都得到了越来越多的发展和应用,为了实现人脸表情识,用于实现人机交互、自动驾驶、 医疗健康检测的终端可以采集目标对象(如用户、驾驶员、路人、病人等)的第一图像,并发起识别表情类型的识别请求,此处的表情类型可以为生气、悲伤、厌恶、害怕、吃惊、高兴、正常等表情。
步骤S204,服务器从第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征,将第一特征和第二特征进行融合处理,得到融合特征,通过融合特征确定第一图像中对象面部的第一表情类型。
其中,本申请实施例并不局限于神经网络模型,其他机器学习模型也适用于本申请实施例。
上述第一图像中像素点的颜色的编码可以为RGB、YUV、CMYK等颜色编码模式中的一种,为了描述统一,后续以RGB为例进行说明,其余颜色编码模式与此类似,不再赘述。
上述第一特征为提取的与表情相关的纹理特征,第二特征为面部组成部分(如人脸五官中的至少之一)、面部轮廓的特征,面部关键点即为描述面部组成部分和/或面部轮廓的特征点。
例如,通过调用神经网络模型的学习,可以学习到不同对象和相同对象在面部表情表达时的共性,通过第一特征(可以准确表示对象的面部纹理)和第二特征(可用来表示面部各个部位的联动、轮廓的变化等)可以利用事先学习到的第一特征和第二特征与面部表情分类之间的关系,准确识别出当前对象的面部表情,同时,即使存在光照变化、遮挡等不理因素,也可以通过第二特征与第一特征的融合避免单独使用第一特征(前述不利因素会导致第一特征的提取不准确)造成的识别不正确。
其中,为了根据第一特征和第二特征的融合特征,识别第一图像的表情类型,可以将第一特征和第二特征进行融合处理,得到融合特征,并通过融合特征确定第一图像中对象面部的第一表情类型。其中,融合过程可 以是基于第一特征和第二特征的权重,对第一特征和第二特征进行加权求和,并将加权求和的结果作为融合特征,以实现第一特征和第二特征的特征融合;或者,将第一特征和第二特征进行线性/非线性映射,对线性/非线性映射后的第一特征和第二特征进行拼接,以实现第一特征和第二特征的特征融合。
通过上述步骤S204,利用神经网络模型从第一图像中识别出第一表情类型,神经网络模型用于根据第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征以及利用第一特征和第二特征确定第一图像中对象面部的第一表情类型,通过融合图像特征和面部关键点特征,考虑了更多的特征以及特征之间的关联,可以提高识别人脸表情类型的准确度,进而达到准确识别面部表情的技术效果。
步骤S206,响应于识别请求,服务器向终端返回识别出的第一表情类型。
通过上述步骤S202至步骤S206,已说明服务器通过神经网络模型可以准确地从第一图像中识别出面部表情,并将面部表情返回至终端的方案。下面对神经网络模型的训练进行说明,具体如下:
本申请实施例提供了一种基于编码图像(如RGB图像)和面部关键点Landmark(如人脸关键点)的多模态人脸表情识别方案,如图6A所示,该方案中的神经网络模型包括依次连接的用于进行图像的第一特征(纹理特征)提取的卷积神经网络(Convolution Neural Networks、CNN)、用于进行面部关键点的第二特征提取的图神经网络(Graph Neural Networks、GNN)、融合层和分类网络(可包括全连接层和分类层),该方案利用卷积神经网络对RGB图像进行建模学习,利用图神经网络对人脸关键点进行建模学习,并通过融合层融合两个模态(RGB图像和人脸关键点)的特征,以得到融合特征,并通过分类网络根据融合特征进行表情识别,得到人脸表情,该 方案通过对RGB图像和人脸关键点之间的相关性和互补性进行建模,可以实现更加鲁棒的人脸表情识别,通过图神经网络可更为灵活而高效的刻画人脸关键点之间的关联,能够提取判别能力更强的人脸关键点特征。其中,本申请实施例并不局限于卷积神经网络和图神经网络,也可以采用其他模型以实现RGB图像和人脸关键点的特征提取。下面结合图2A所示的步骤详述本申请的技术方案。
参见图2B,图2B示出图2A在执行步骤S202之前,可以按照如下方式预先训练好神经网络模型,即将训练集输入到神经网络模型,神经网络模型输出预计结果,由于神经网络模型的预计结果与实际结果有误差,则计算预计结果与实际结果之间的误差,并将该误差在神经网络模型中进行反向传播,以调整神经网络模型中所有层的参数的值,该所有层包括卷积神经网络、图神经网络、融合层和分类网络;不断迭代上述过程,直至收敛,以完成神经网络模型的训练:
步骤S11,获取训练集,其中,训练集中的训练图像标识有表情类型且训练图像的颜色编码类型与第一图像相同。
例如,可以预先获取一个数据集(如AffectNet人脸表情数据集),将该数据集中的图像划分为训练集和测试集,所划分的方式可以为随机划分,以便于训练集和测试集中的图像的特征保持相同或者基本相同的分布,在图片所占比例上,一般训练集的图片数量大于测试集的图片,例如训练集中图片占了数据集的80%,测试集占了其中20%。
步骤S12,将训练集中的训练图像作为神经网络模型的输入,对神经网络模型进行训练得到初始神经网络模型,初始神经网络模型是以训练集中的训练图像为输入,并以训练图像标识的表情类型为预计输出时,初始化神经网络模型的网络层中的权重后得到的。
在神经网络模型中,每个神经元有输入连接和输出连接,这些连接模 拟了大脑中突触的行为,与大脑中突触传递信号的方式类似,信号从一个神经元传递到另一个神经元,每一个连接都有权重,即发送到每个连接的值要乘以这个权重,权重实际上相当于生物神经元之间传递的神经递质的数量,如果某个连接重要,那么它将具有比那些不重要的连接更大的权重值。而训练过程就是赋予这些权重的过程。该技术方案中,可以采用监督学习实现,训练集包括输入(如图像的RGB编码和采用图数据结构的面部图)和期望的输出(即面部表情类型),通过这种方式,网络可以检查它的计算结果和期望输出的差异,并据此采取适当的处理。
训练集中的每个训练图像包括输入值和期望的输出,一旦网络计算出其中一个输入的输出(初始时可随机赋予权重数值),根据误差函数便可计算出对应的误差,这个误差表明模型的实际输出与期望的输出有多接近。此处使用的误差函数是均方误差函数,如公式(1)所示:
Figure PCTCN2020092593-appb-000001
其中,x表示训练集中的输入,y(x)表示神经网络模型产生的输出,a表示期望的输出,可以看到这个均方误差函数是关于w和b的函数,w表示权重,b表示偏差(biases),在每次得到输出后,对应的误差被返回神经网络模型,并且相应地调整权重,从而使得神经网络模型通过该算法完成一次对所有权重的调整,循环往复,直至训练的图像量达到一定的值。
步骤S13,获取以测试集中的测试图像为初始神经网络模型的输入时,初始神经网络模型输出的第二表情类型,测试集中的测试图像标识有表情类型且颜色编码类型与第一图像相同。
步骤S14,当初始神经网络模型输出的第二表情类型、与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值时,将初始神经网络模型作为训练好的神经网络模型。
上述的匹配正确率是通过计算初始神经网络模型对多个测试图像的输 出得到的,如对100张测试图像,能够正确识别其中的95张,则匹配正确率为95%,若目标阈值是98%,由于实际正确匹配率小于目标阈值,说明该模型欠拟合,那么还需继续对初始神经网络模型进行训练,若能够正确识别100张测试图像中的99张,那么说明模型已经比较成熟,可以投入实际应用中了。
步骤S15,当初始神经网络模型输出的第二表情类型、与测试集中的测试图像标识的表情类型之间的匹配正确率小于目标阈值时,将训练集中的训练图像作为初始神经网络模型的输入,继续对初始神经网络模型进行训练,直至初始神经网络模型输出的第二表情类型、与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值。
在使用上述方法训练好神经网络模型之后,即可使用该模型进行面部表情类型的识别,在步骤S202提供的技术方案中,服务器获取终端的识别请求,识别请求用于指示识别第一图像中对象面部的表情类型。该识别请求可以直接为第一图像,或者携带有第一图像的指示信息(如第一图像的图像标志、存储地址等)的请求消息。
在步骤S204提供的技术方案中,服务器从第一图像中确定出第一图像中对象面部的第一表情类型。见图2C,图2C示出图2A中的步骤204包括步骤S2042-步骤S2044:
步骤S2042,在卷积神经网络中,利用第一图像中像素点的颜色信息提取用于表示第一图像中纹理的第一特征,并在图神经网络中,提取用于表示面部关键点之间关联的第二特征,其中,面部关键点用于表示对象面部的组成部分和/或面部轮廓。
例如,利用第一图像中像素点的颜色信息提取用于表示第一图像中纹理的第一特征包括:将第一图像中像素点的颜色编码数据(如第一图像的RGB编码数据)作为卷积神经网络的输入,卷积神经网络用于对第一图像 中像素点的颜色编码执行卷积操作,得到第一特征,如描述笑容时眼角上扬的特征;获取卷积神经网络输出的第一特征。
在一些实施例中,为了提高识别的准确率,可以对第一图像进行预处理,以使其分辨率、长宽、参考点符合要求,在将第一图像中像素点的颜色编码作为卷积神经网络的输入时,当第一图像中的参考点在第一图像中的位置、与图片模板中的参考点在图片模板中的位置不同时,执行对第一图像的裁剪操作和/或缩放操作,如执行以下操作:移动第一图像以使其参考点与模板的参考点在位置上重合,然后以参考点为原点进行缩放以使其分辨率与模板相同,再对其进行裁剪以使其长宽与模板相同,从而得到第二图像,第二图像中的参考点在第二图像中的位置与图片模板中的参考点在图片模板中的位置相同;将第二图像中像素点的颜色编码作为卷积神经网络的输入。
例如,在根据第一面部图确定第二面部图之前,可利用多张第三图像确定面部关键点、关键点之间的关联关系以及关键点之间的关联权重,其中,第三图像为标识有表情类型的图像,可以对多张第三图像(均为具有明显面部表情的图像)进行分析,以确定五官周围、面部轮廓上与表情直接最相关关键点(即受表情影响时有运动幅度或者能体现该表情的点),而具备关联关系的关键点是指在同一表情下能够产生联动的点,而关联权重是对两个关键点之间的关联程度(例如,可以根据对不同人群的研究后取经验值)进行处理(如归一化)后得到的;以面部关键点为节点、连接位于节点之间的用于表示面部关键点之间存在关联关系的边,并将存在关联关系的关键点之间的关联权重作为边的权重,以得到第一面部图。
在提取用于表示面部关键点之间关联的第二特征时,可根据第一面部图确定第二面部图,其中,第一面部图包括表示面部关键点的节点、位于节点之间的表示面部关键点之间存在关联关系的边以及边的关联权重,第 二面部图为在第一面部图中增加节点对应的面部关键点在第一图像中的位置之后得到的;对第二面部图进行特征提取得到第二特征。
步骤S2044,在分类网络中,通过预先的训练学习到了不同的第一特征、第二特征二者与不同的表情类型之间的对应关系,从多个表情类型中识别出与第一特征和第二特征对应的第一表情类型。
在一些实施例中,使用了基于RGB图像的人脸表情识别方案和基于人脸关键点的表情识别方案,其中基于RGB图像的表情识别方案主要是从人脸图像中提取与表情相关的特征(即第一特征)并进行分类,但由于RGB图像受光照变化和遮挡等因素的影响很大,仅仅依靠RGB图像数据的人脸表情识别系统的鲁棒性较差;基于人脸关键点的表情识别方案中,人脸关键点主要指人脸的五官和轮廓所在的点,这些点的位置信息与人脸表情密切相关,随着人脸关键点预测越来越准确,基于关键点的人脸表情识别也越来越准确,在基于人脸关键点的表情识别中可利用手工设计(hand-craft)的特征,并利用浅层模型进行分类,如利用支持向量机(Support Vector Machine,SVM)模型进行表情分类,由于人脸关键点具有丰富的结构信息,并且不同关键点之间具有密切的关联,所以采用该方案能够准确识别人脸面部表情,但是若采用手工设计的特征则无法灵活而有效地对不同关键点之间的丰富而复杂的关联进行建模,导致基于关键点的人脸表情识别性能较差。
考虑到基于RGB图像的表情识别可以获取更加丰富的人脸纹理信息,但其对光照变化等不具备很好的鲁棒性,而基于人脸关键点的表情识别对光照等变化更加鲁棒,但其丢失了大部分的纹理信息,融合RGB图像和人脸关键点对人脸表情识别很有帮助。本申请实施例提供了一种基于RGB图像和人脸关键点的多模态人脸表情识别方案,该方案利用RGB图像和人脸关键点的互补性,实现更加鲁棒的人脸表情识别,针对手工设计特征无法 高效刻画人脸关键点的关联,该方案利用图神经网络灵活而高效的对人脸关键点进行建模,图神经网络能够自适应的学习关键点之间的关联,显著提升基于关键点的人脸表情识别性能。
在步骤S206提供的技术方案中,响应于识别请求,服务器向终端返回识别出的第一表情类型。
例如,在向终端返回识别出的第一表情类型之后,可获取终端的反馈信息,反馈信息用于指示识别出的第一表情类型是否正确;在反馈信息指示识别出的第一表情类型不正确的情况下,使用与第一图像具备相同的图像特征的第四图像对神经网络模型进行训练,第四图像可以为与第一图像的面部表情类型相同的图像、或者背景类型与之相同的图像。采用该技术方案,相当于可以针对神经网络模型的识别薄弱环节进行针对性的提高。
人脸表情识别在人机交互、自动驾驶和医疗健康等领域都得到了越来越多的发展和应用,例如,本申请实施例可以用于辅助机器人识别人的情绪和心理,提升人机交互产品中的用户体验,如图3所示,如识别到人做出生气的表情时,机器人301可以通过讲笑话等缓解人的情绪,提升用户体验;本申请实施例也可以用于商场、银行等客户满意度分析,如图4所示,如通过银行服务窗口401中的监控器拍摄顾客在交易过程中人脸表情,并分析监控视频中的人脸表情判断顾客在银行中的交易满意度等;本申请实施例还可以用于动画表情模拟和生成,如识别真实人脸的表情并将其自然的迁移到动画形象上,如图5所示,当识别到人做出忧伤的表情时,动画形象501也将呈现相应的忧伤表情。下面结合实施方式详述本申请的技术方案。
本申请实施例提供了一种基于RGB图像和人脸关键点的多模态人脸表情识别系统,图6B所示为多模态人脸表情识别框架,可给定一张待识别的图像,首先进行人脸检测和人脸对齐,并提取人脸关键点信息;然后利用 卷积神经网络自适应的对RGB图像进行特征学习,利用图神经网络自适应的建模人脸关键点之间的关联并进行关键点特征学习,所得到的RGB特征和关键点特征融合起来用于最后的分类;整个识别系统可以实现端到端的训练和预测。
在人脸关键点特征学习中,可给定对齐后的人脸图像,模型先从图像中提取人脸关键点信息,如图7所示,例如关键点701-702为表示脸部轮廓的点。人脸关键点(如编号1-68所示的点)定位人脸面部的关键区域位置,如脸部轮廓、眉毛、眼睛、鼻子和嘴巴等;同一个人做不同表情时,人脸关键点位置通常不同,因而可以利用人脸关键点信息辅助人脸表情识别,人脸关键点之间通常存在复杂的关联,如做“惊讶”表情时,眉毛和眼睛附近的关键点位置通常会一起变化等。考虑到采用基于关键点信息的人脸表情识别时若使用手工设计的特征,如关键点位置信息的堆叠或不同关键点之间的距离等,这些手工设计的特征无法有效的建模关键点之间的关联,所得到的关键点特征判别能力较差,在大规模人脸表情识别数据集上的识别准确率很差。
为更好的建模和利用人脸关键点之间的关联,本申请实施例采用图神经网络对人脸关键点进行高效的建模,将人脸图像作为一个高度结构化的数据,根据人脸五官分布将人脸关键点组成一个图网络结构,如图8所示,图中每一个顶点表示一个人脸关键点,每一条边表示人脸关键点之间的关联,例如边801表示轮廓关键点701和轮廓关键点702之间的关联。令X landmark表示输入的人脸关键点信息,A表示图邻接矩阵,A ij=1表示第i个关键点和第j个关键点之间存在边,A ij=0表示第i个关键点和第j个关键点之间不存在边。基于图神经网络的人脸关键点特征学习表示如公式(2)所示:
Figure PCTCN2020092593-appb-000002
其中,f gcn表示图神经网络,由于图邻接矩阵A是0-1矩阵,其只能表 示关键点之间有无关联,而无法衡量不同边的权重,不同关键点之间的相关关系的强弱互不相同,为更好的衡量不同关键点之间的相关关系,本申请实施例引入可学习参数W,基于图神经网络的人脸关键点特征学习表示如公式(3)所示:
Figure PCTCN2020092593-appb-000003
其中,
Figure PCTCN2020092593-appb-000004
为修正的带权重的邻接矩阵,并且权重W是自适应学习,Y landmark表示从人脸关键点得到的特征。
在进行RGB图像特征学习时,RGB图像特征由RGB图像提取得到,RGB图像是经过人脸检测和对齐处理后的人脸图像直接得到的,由于卷积神经网络在图像特征学习和图像识别等领域取得显著的效果,本申请实施例采用卷积神经网络提取RGB图像特征,令X rgb表示图像的原始RGB输入,本申请实施例得到的RGB图像特征表示如公式(4)所示:
Y rgb=f cnn(X rgb)        (4)
其中,f cnn为基于RGB图像的卷积神经网络,Y rgb表示学习到的RGB图像特征。
在通过融合层进行多模态特征融合时,RGB图像信息和人脸关键点信息相互补充,本方法将学习到的人脸关键点特征Y landmark和RGB图像特征Y rgb融合起来,得到整体特征Y如公式(5)所示:
Y=g(X landmark,X rgb)       (5)
其中,g表示特征融合,基于融合后的特征Y利用全连接网络进行表情分类。整个网络结构包括人脸关键点特征提取分支f gcn、RGB图像特征提取分支f cnn以及全连接分类网络可以实现端到端的训练,在网络训练过程中采取加权损失函数最小化,缓解人脸表情识别中严重的类别不均衡现象。
其中,对人脸关键点特征Y landmark和RGB图像特征Y rgb进行加权求和,将加权求和的结果作为融合后的特征Y,以实现特征融合,并通过全连接网络 对融合后的特征Y进行预测,以获得人脸图像的表情。其中,当人脸关键点特征相对于表情识别的贡献比较大时,相对于RGB图像特征Y rgb的权重,人脸关键点特征的权重比较大。通过全连接网络学习到的人脸关键点特征、RGB图像特征的融合后的特征与面部表情分类之间的关系,准确识别出当前对象的面部表情,同时,即使存在光照变化、遮挡等不理因素,也可以通过融合后的特征避免单独使用RGB图像特征(前述不利因素会导致RGB图像特征的提取不准确)造成的识别不正确。或者,将人脸关键点特征Y landmark和RGB图像特征Y rgb进行线性/非线性映射,并对线性/非线性映射后的人脸关键点特征Y landmark和RGB图像特征Y rgb进行拼接,将拼接后的结果作为融合后的特征Y,以实现特征融合,并通过全连接网络对融合后的特征Y进行预测,以获得人脸图像的表情,其中,线性/非线性映射为各种变形的计算方法,并不限定于一种计算方法。
为验证本申请的方法的有效性,本方法采用AffectNet人脸表情数据集,其包含七类基本人脸表情:愤怒,厌恶,恐惧,高兴,自然,悲伤,惊讶等。其数据(包括训练集和验证集,验证集也称测试集)分布如下表1所示:
表1
  愤怒 厌恶 恐惧 高兴 自然 悲伤 惊讶
训练集 25382 4303 6878 134915 75374 25959 14590
验证集 500 500 500 500 500 500 500
其中,AffectNet数据集中7种基本表情的数据分布如表2所示。
表2
  愤怒 厌恶 恐惧 高兴 自然 悲伤 惊讶 平均
Landmark-Linear 11.4 28.4 9.6 67.6 10.2 35.6 39.3 28.9
Landmark-SVM 20.7 0.0 0.0 100.0 3.3 2.9 9.8 19.5
Landmark-GCN 46.1 47.2 47.4 80.3 47.5 43.0 47.5 51.3
其中,不同人脸关键点模型在七种表情下的识别准确率和平均识别准 确率如表3所示。
表3
Figure PCTCN2020092593-appb-000005
由于采用了基于图神经网络的表情识别,为验证本申请提出的基于图神经网络的人脸关键点特征提取的有效性,表2给出了基于图神经网络的关键点特征的表情识别模型(Landmark-GCN)在AffectNet七种表情下的识别准确率,最后一列为平均识别准确率,表2中同时给出了基于手工设计特征的关键点人脸表情识别模型:线性分类模型(Landmark-Linear)和SVM分类模型(Landmark-SVM)的分类准确率,可以看出,表3中本方法提出的图神经网络所提取的人脸关键点特征具备很好的判别性,其识别效果显著优于基于手工设计特征的模型。
在采用基于多模特信息融合的人脸表情识别方案时,表3给出了基于RGB图像特征的表情识别,基于图神经网络的人脸关键点的表情识别和基于多模态融合的表情识别在AffectNet七种表情下的识别准确率和平均识别准确率。可以看出,本申请实施例提出的基于RGB图像和人脸关键点的多模态表情识别方法取得了最高的平均识别准确率。
本申请提出了一种多模态信息融合的人脸表情识别方法。该方法同时考虑了RGB图像和人脸关键点的互补信息,可以显著提升人脸表情识别的准确率。该申请适用于提升人机交互产品中的用户体验,辅助商场、银行等分析顾客的满意度以及辅助动画表情模拟和生成等。
本申请实施例依据人脸结构信息构建人脸关键点图网络结构,人脸关键点的个数和位置不限于图7所示,人脸关键点的图网络结构不限于图8所示,可以是任意个数的关键点和任意的图网络结构。本申请实施例采取 卷积神经网络和图神经网络分别对RGB图像和人脸关键点进行建模,不限定某一种卷积神经网络或图神经网络。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
本申请实施例提供了一种用于实施上述面部表情的识别方法的面部表情的识别装置。图9是本申请实施例的一种面部表情的识别装置的示意图,如图9所示,该装置可以包括:第一获取单元901、识别单元903以及响应单元905(其中,本申请实施例可对第一获取单元901和响应单元905进行适应的选用)。
第一获取单元901,配置为获取终端的识别请求,其中,识别请求用于请求识别第一图像中对象面部的表情类型。
识别单元903,配置为从第一图像中像素点的颜色信息中提取第一特征;从第一图像中提取面部关键点的第二特征;将第一特征和第二特征进行融合处理,得到融合特征,通过融合特征确定第一图像中对象面部的第 一表情类型。
响应单元905,配置为响应于识别请求,向终端返回识别出的第一表情类型。
需要说明的是,该实施例中的第一获取单元901可以配置为执行本申请实施例中的步骤S202,该实施例中的识别单元903可以配置为执行本申请实施例中的步骤S204,该实施例中的响应单元905可以配置为执行本申请实施例中的步骤S206。
此处需要说明的是,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在如图1所示的硬件环境中,可以通过软件实现,也可以通过硬件实现。
通过上述模块,利用神经网络模型从第一图像中识别出第一表情类型,神经网络模型用于根据第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征以及利用第一特征和第二特征确定第一图像中对象面部的第一表情类型;通过融合图像特征和面部关键点特征,可以提高识别人脸表情类型的准确度,进而达到准确识别面部表情的技术效果。
在一些实施例中,识别单元可包括:处理模块,配置为通过卷积神经网络,从第一图像中像素点的颜色信息提取用于表示第一图像中纹理的第一特征,并通过图神经网络,提取用于表示面部关键点之间关联的第二特征,其中,面部关键点用于表示对象面部的组成部分和/或面部轮廓;通过融合层,对第一特征和第二特征进行特征融合处理,以得到融合特征;识别模块,配置为通过分类网络,从多个表情类型中识别出与融合特征对应的第一表情类型。
在一些实施例中,处理模块还可配置为:基于第一特征和第二特征的 权重,对第一特征和第二特征进行加权求和,并将加权求和的结果作为融合特征;或者,对第一特征和第二特征进行拼接处理,以得到融合特征。
在一些实施例中,处理模块还可配置为:将第一图像中像素点的颜色编码作为卷积神经网络的输入,其中,卷积神经网络用于对第一图像中像素点的颜色编码执行卷积操作,得到第一特征;获取卷积神经网络输出的第一特征。
在一些实施例中,处理模块在将第一图像中像素点的颜色编码作为卷积神经网络的输入时,还可配置为:当第一图像中的参考点在第一图像中的位置、与图片模板中的参考点在图片模板中的位置不同时,对第一图像进行裁剪操作和/或缩放操作,得到第二图像,以使第二图像中的参考点在第二图像中的位置、与图片模板中的参考点在图片模板中的位置相同;将第二图像中像素点的颜色编码作为卷积神经网络的输入。
在一些实施例中,处理模块还可配置为:在第一面部图中增加第一图像中节点对应的面部关键点的位置,以得到第二面部图,其中,第一面部图包括表示面部关键点的节点、位于节点之间的表示面部关键点之间存在关联关系的边以及边的关联权重;对第二面部图进行特征提取得到第二特征。
在一些实施例中,处理模块还可配置为根据多张第三图像确定面部关键点、关键点之间的关联关系以及关键点之间的关联权重,其中,第三图像为标识有表情类型的图像;以面部关键点为节点,连接位于节点之间的用于表示面部关键点之间存在关联关系的边,并将存在关联关系的关键点之间的关联权重作为边的权重,以得到第一面部图。
在一些实施例中,上述装置还可包括:第二获取单元,配置为获取训练集,其中,训练集中的训练图像标识有表情类型且颜色编码类型与第一图像相同;训练单元,配置为将训练集中的训练图像作为神经网络模型的 输入,对神经网络模型进行训练得到初始神经网络模型,其中,初始神经网络模型是以训练集中的训练图像为输入,并以训练图像标识的表情类型为预计输出时,初始化神经网络模型的网络层中的权重后得到的;第三获取单元,配置为获取以测试集中的测试图像为初始神经网络模型的输入时初始神经网络模型输出的第二表情类型,其中,测试集中的测试图像标识有表情类型且颜色编码类型与第一图像相同;确定单元,配置为配置为当初始神经网络模型输出的第二表情类型、与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值时,将初始神经网络模型作为训练好的神经网络模型;其中,训练单元还配置为当初始神经网络模型输出的第二表情类型、与测试集中的测试图像标识的表情类型之间的匹配正确率小于目标阈值时,将训练集中的训练图像作为初始神经网络模型的输入,继续对初始神经网络模型进行训练,直至初始神经网络模型输出的第二表情类型、与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值。
在一些实施例中,上述装置还可包括:反馈单元,配置为获取反馈信息,其中,反馈信息用于指示识别出的第一表情类型是否正确;当反馈信息指示识别出的第一表情类型不正确时,使用与第一图像具备相同的图像特征的第四图像对神经网络模型进行训练。
此处需要说明的是,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在如图1所示的硬件环境中,可以通过软件实现,也可以通过硬件实现,其中,硬件环境包括网络环境。
本申请实施例提供了一种用于实施上述面部表情的识别方法的服务器或终端。
图10是本申请实施例的一种终端的结构框图,如图10所示,该终端 可以包括:一个或多个(图10中仅示出一个)处理器1001、存储器1003、以及传输装置1005,如图10所示,该终端还可以包括输入输出设备1007。
其中,存储器1003可用于存储软件程序以及模块,如本申请实施例中的面部表情的识别方法和装置对应的程序指令/模块,处理器1001通过运行存储在存储器1003内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的面部表情的识别方法。存储器1003可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器1003可包括相对于处理器1001远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置1005用于经由一个网络接收或者发送数据,还可以用于处理器与存储器之间的数据传输。上述的网络实例可包括有线网络及无线网络。在一个实例中,传输装置1005包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置1005为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
其中,存储器1003用于存储应用程序。
处理器1001可以通过传输装置1005调用存储器1003存储的应用程序,以执行下述步骤:
获取终端的识别请求,其中,识别请求用于请求识别第一图像中对象面部的表情类型;
从第一图像中像素点的颜色信息中提取第一特征;
从第一图像中提取面部关键点的第二特征;
将第一特征和第二特征进行融合处理,得到融合特征;
通过融合特征确定第一图像中对象面部的第一表情类型;
响应于识别请求,向终端返回识别出的第一表情类型。
处理器1001还用于执行下述步骤:
获取训练集,其中,训练集中的训练图像标识有表情类型且颜色编码类型与第一图像相同;
将训练集中的训练图像作为神经网络模型的输入,对神经网络模型进行训练得到初始神经网络模型,其中,初始神经网络模型是以训练集中的训练图像为输入并以训练图像标识的表情类型为预计输出时,初始化神经网络模型的网络层中的权重后得到的;
获取以测试集中的测试图像为初始神经网络模型的输入时初始神经网络模型输出的第二表情类型,其中,测试集中的测试图像标识有表情类型且颜色编码类型与第一图像相同;
当初始神经网络模型输出的第二表情类型与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值时,将初始神经网络模型作为训练好的神经网络模型;
当初始神经网络模型输出的第二表情类型与测试集中的测试图像标识的表情类型之间的匹配正确率小于目标阈值时,将训练集中的训练图像作为初始神经网络模型的输入,继续对初始神经网络模型进行训练,直至初始神经网络模型输出的第二表情类型与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值。
采用本申请实施例,获取终端的识别请求,识别请求用于请求识别第一图像中对象面部的表情类型;利用神经网络模型从第一图像中识别出第一表情类型,神经网络模型用于根据第一图像中像素点的颜色信息提取第一特征、从第一图像中提取面部关键点的第二特征以及利用第一特征和第二特征确定第一图像中对象面部的第一表情类型;响应于识别请求,向终 端返回识别出的第一表情类型,通过融合图像特征和面部关键点特征,可以提高识别人脸表情类型的准确度,进而达到准确识别面部表情的技术效果。
例如,本实施例中的示例可以参考上述实施例中所描述的示例,本实施例在此不再赘述。
本领域普通技术人员可以理解,图10所示的结构仅为示意,终端可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图10其并不对上述电子设备的结构造成限定。例如,终端还可包括比图10中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图10所示不同的配置。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,计算机可读存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
本申请实施例还提供了一种计算机可读存储介质。例如,上述计算机可读存储介质可以用于执行面部表情的识别方法的程序代码。
例如,在本实施例中,上述存储介质可以位于上述实施例所示的网络中的多个网络设备中的至少一个网络设备上。
例如,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:
获取终端的识别请求,其中,识别请求用于请求识别第一图像中对象面部的表情类型;
从第一图像中像素点的颜色信息中提取第一特征;
从第一图像中提取面部关键点的第二特征;
将第一特征和第二特征进行融合处理,得到融合特征;
通过融合特征确定第一图像中对象面部的第一表情类型;
响应于识别请求,向终端返回识别出的第一表情类型。
例如,计算机可读存储介质还被设置为存储用于执行以下步骤的程序代码:
获取训练集,其中,训练集中的训练图像标识有表情类型且颜色编码类型与第一图像相同;
将训练集中的训练图像作为神经网络模型的输入,对神经网络模型进行训练得到初始神经网络模型,其中,初始神经网络模型是以训练集中的训练图像为输入并以训练图像标识的表情类型为预计输出时,初始化神经网络模型的网络层中的权重后得到的;
获取以测试集中的测试图像为初始神经网络模型的输入时初始神经网络模型输出的第二表情类型,其中,测试集中的测试图像标识有表情类型且颜色编码类型与第一图像相同;
在初始神经网络模型输出的第二表情类型与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值的情况下,将初始神经网络模型作为训练好的神经网络模型;
在初始神经网络模型输出的第二表情类型与测试集中的测试图像标识的表情类型之间的匹配正确率小于目标阈值的情况下,将训练集中的训练图像作为初始神经网络模型的输入,继续对初始神经网络模型进行训练,直至初始神经网络模型输出的第二表情类型与测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值。
例如,本申请实施例中的示例可以参考上述实施例中所描述的示例,本申请实施例在此不再赘述。
例如,在一些实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本申请的实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。
工业实用性
本申请实施例中通过电子设备从第一图像中像素点的颜色信息中提取第一特征,从第一图像中提取面部关键点的第二特征,并根据第一特征和第二特征进行融合所得到的融合特征,确定出第一图像中对象面部的第一表情类型。如此,提高识别人脸表情类型的准确度,进而达到准确识别面部表情的目的。

Claims (16)

  1. 一种面部表情的识别方法,所述方法由电子设备执行,所述方法包括:
    从第一图像中像素点的颜色信息中提取第一特征;
    从所述第一图像中提取面部关键点的第二特征;
    将所述第一特征和所述第二特征进行融合处理,得到融合特征;
    通过所述融合特征确定所述第一图像中对象面部的第一表情类型。
  2. 根据权利要求1所述的方法,其中,
    所述从第一图像中像素点的颜色信息中提取第一特征,包括:
    通过卷积神经网络执行以下处理:
    从所述第一图像中像素点的颜色信息中提取用于表示所述第一图像中纹理的所述第一特征;
    所述从所述第一图像中提取面部关键点的第二特征,包括:
    通过图神经网络执行以下处理:
    提取用于表示所述面部关键点之间关联的所述第二特征,其中,所述面部关键点用于表示对象面部的组成部分和/或面部轮廓;
    所述将所述第一特征和所述第二特征进行融合处理,得到融合特征,包括:
    通过融合层执行以下处理:
    对所述第一特征和所述第二特征进行特征融合处理,以得到融合特征;
    所述通过所述融合特征确定所述第一图像中对象面部的第一表情类型,包括:
    通过分类网络执行以下处理:
    从多个表情类型中识别出与所述融合特征对应的所述第一表情类型。
  3. 根据权利要求2所述的方法,其中,所述从所述第一图像中像素点的颜色信息提取用于表示所述第一图像中纹理的所述第一特征,包括:
    将所述第一图像中像素点的颜色编码作为所述卷积神经网络的输入,其中,所述卷积神经网络用于对所述第一图像中像素点的颜色编码执行卷积操作,得到所述第一特征;
    获取所述卷积神经网络输出的所述第一特征。
  4. 根据权利要求3所述的方法,其中,所述将所述第一图像中像素点的颜色编码作为所述卷积神经网络的输入,包括:
    当所述第一图像中的参考点在所述第一图像中的位置、与图片模板中的参考点在所述图片模板中的位置不同时,对所述第一图像进行裁剪操作和/或缩放操作,得到第二图像,以使所述第二图像中的参考点在所述第二图像中的位置、与所述图片模板中的参考点在所述图片模板中的位置相同;
    将所述第二图像中像素点的颜色编码作为所述卷积神经网络的输入。
  5. 根据权利要求2所述的方法,其中,所述提取用于表示所述面部关键点之间关联的所述第二特征,包括:
    在第一面部图中增加所述第一图像中节点对应的所述面部关键点的位置,以得到第二面部图,其中,所述第一面部图包括表示所述面部关键点的节点、位于节点之间的表示所述面部关键点之间存在关联关系的边以及边的关联权重;
    对所述第二面部图进行特征提取得到所述第二特征。
  6. 根据权利要求5所述的方法,其中,所述在得到第二面部图之前,所述方法包括:
    根据多张第三图像确定所述面部关键点、所述关键点之间的关联关系以及所述关键点之间的关联权重,其中,所述第三图像为标识有表情类型的图像;
    以所述面部关键点为节点,连接位于节点之间的用于表示所述面部关键点之间存在关联关系的边,并将存在关联关系的所述关键点之间的关联 权重作为边的权重,以得到所述第一面部图。
  7. 根据权利要求1所述的方法,其中,所述将所述第一特征和所述第二特征进行融合处理,得到融合特征,包括:
    基于所述第一特征和所述第二特征的权重,对所述第一特征和所述第二特征进行加权求和,并将加权求和的结果作为所述融合特征;或者,
    对所述第一特征和所述第二特征进行拼接处理,以得到所述融合特征。
  8. 根据权利要求1所述的方法,其中,所述在确定所述第一图像中对象面部的第一表情类型之前,所述方法包括:
    获取训练集,其中,所述训练集中的训练图像标识有表情类型且颜色编码类型与所述第一图像相同;
    将所述训练集中的训练图像作为神经网络模型的输入,对所述神经网络模型进行训练得到初始神经网络模型,其中,所述初始神经网络模型是以所述训练集中的训练图像为输入,并以所述训练图像标识的表情类型为预计输出时,初始化所述神经网络模型的网络层中的权重后得到的;
    获取以测试集中的测试图像为所述初始神经网络模型的输入时,所述初始神经网络模型输出的第二表情类型,其中,所述测试集中的测试图像标识有表情类型且颜色编码类型与所述第一图像相同;
    当所述初始神经网络模型输出的第二表情类型、与所述测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值时,将所述初始神经网络模型作为所述训练好的神经网络模型;
    当所述初始神经网络模型输出的第二表情类型、与所述测试集中的测试图像标识的表情类型之间的匹配正确率小于所述目标阈值时,将所述训练集中的训练图像作为所述初始神经网络模型的输入,继续对所述初始神经网络模型进行训练,直至所述初始神经网络模型输出的第二表情类型、与所述测试集中的测试图像标识的表情类型之间的匹配正确率达到所述目 标阈值。
  9. 根据权利要求1至8中任意一项所述的方法,其中,所述方法还包括:
    向终端返回识别出的所述第一表情类型;
    获取所述终端的反馈信息,其中,所述反馈信息用于指示识别出的所述第一表情类型是否正确;
    当所述反馈信息指示识别出的所述第一表情类型不正确时,使用与所述第一图像具备相同的图像特征的第四图像对所述神经网络模型进行训练。
  10. 一种面部表情的识别装置,包括:
    识别单元,配置为从第一图像中像素点的颜色信息中提取第一特征;
    从所述第一图像中提取面部关键点的第二特征;
    将所述第一特征和所述第二特征进行融合处理,得到融合特征,通过所述融合特征确定所述第一图像中对象面部的第一表情类型。
  11. 根据权利要求10所述的装置,其中,
    所述识别单元包括:
    处理模块,配置为通过卷积神经网络,从所述第一图像中像素点的颜色信息提取用于表示所述第一图像中纹理的所述第一特征,并通过图神经网络,提取用于表示所述面部关键点之间关联的所述第二特征,其中,所述面部关键点用于表示对象面部的组成部分和/或面部轮廓;通过融合层,对所述第一特征和所述第二特征进行特征融合处理,以得到融合特征;
    识别模块,配置为通过分类网络,从多个表情类型中识别出与所述第一特征和所述第二特征对应的所述第一表情类型。
  12. 根据权利要求11所述的装置,其中,所述处理模块还配置为:
    将所述第一图像中像素点的颜色编码作为所述卷积神经网络的输入,其中,所述卷积神经网络用于对所述第一图像中像素点的颜色编码执行卷 积操作,得到所述第一特征;
    获取所述卷积神经网络输出的所述第一特征。
  13. 根据权利要求11所述的装置,其中,所述处理模块还配置为:
    在第一面部图中增加所述第一图像中节点对应的所述面部关键点的位置,以得到第二面部图,其中,所述第一面部图包括表示所述面部关键点的节点、位于节点之间的表示所述面部关键点之间存在关联关系的边以及边的关联权重;
    对所述第二面部图进行特征提取得到所述第二特征。
  14. 根据权利要求10所述的装置,其中,所述装置包括:
    第二获取单元,配置为获取训练集,其中,所述训练集中的训练图像标识有表情类型且颜色编码类型与所述第一图像相同;
    训练单元,配置为将所述训练集中的训练图像作为神经网络模型的输入,对所述神经网络模型进行训练得到初始神经网络模型,其中,所述初始神经网络模型是以所述训练集中的训练图像为输入,并以所述训练图像标识的表情类型为预计输出时,初始化所述神经网络模型的网络层中的权重后得到的;
    第三获取单元,配置为获取以测试集中的测试图像为所述初始神经网络模型的输入时所述初始神经网络模型输出的第二表情类型,其中,所述测试集中的测试图像标识有表情类型且颜色编码类型与所述第一图像相同;
    确定单元,配置为当所述初始神经网络模型输出的第二表情类型、与所述测试集中的测试图像标识的表情类型之间的匹配正确率达到目标阈值时,将所述初始神经网络模型作为所述神经网络模型;
    其中,所述训练单元还配置为当所述初始神经网络模型输出的第二表情类型、与所述测试集中的测试图像标识的表情类型之间的匹配正确率小 于所述目标阈值时,将所述训练集中的训练图像作为所述初始神经网络模型的输入,继续对所述初始神经网络模型进行训练,直至所述初始神经网络模型输出的第二表情类型、与所述测试集中的测试图像标识的表情类型之间的匹配正确率达到所述目标阈值。
  15. 一种计算机可读存储介质,所述计算机可读存储介质包括存储的程序,所述程序运行时执行上述权利要求1至9任一项中所述的方法。
  16. 一种电子设备,包括存储器以及处理器;
    其中,所述存储器用于存储计算机程序;
    所述处理器用于运行所述存储器中的计算机程序,通过所述计算机程序执行上述权利要求1至9任一项中所述的方法。
PCT/CN2020/092593 2019-06-03 2020-05-27 面部表情的识别方法、装置、电子设备及存储介质 WO2020244434A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/473,887 US20210406525A1 (en) 2019-06-03 2021-09-13 Facial expression recognition method and apparatus, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910478195.3 2019-06-03
CN201910478195.3A CN110263681B (zh) 2019-06-03 2019-06-03 面部表情的识别方法和装置、存储介质、电子装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/473,887 Continuation US20210406525A1 (en) 2019-06-03 2021-09-13 Facial expression recognition method and apparatus, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2020244434A1 true WO2020244434A1 (zh) 2020-12-10

Family

ID=67916517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092593 WO2020244434A1 (zh) 2019-06-03 2020-05-27 面部表情的识别方法、装置、电子设备及存储介质

Country Status (3)

Country Link
US (1) US20210406525A1 (zh)
CN (1) CN110263681B (zh)
WO (1) WO2020244434A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560730A (zh) * 2020-12-22 2021-03-26 电子科技大学中山学院 一种基于Dlib与人工神经网络的人脸表情识别方法
CN113239875A (zh) * 2021-06-01 2021-08-10 恒睿(重庆)人工智能技术研究院有限公司 人脸特征的获取方法、系统、装置及计算机可读存储介质
CN113688715A (zh) * 2021-08-18 2021-11-23 山东海量信息技术研究院 面部表情识别方法及系统
CN114612901A (zh) * 2022-03-15 2022-06-10 腾讯科技(深圳)有限公司 图像变化识别方法、装置、设备和存储介质
CN116934754A (zh) * 2023-09-18 2023-10-24 四川大学华西第二医院 基于图神经网络的肝脏影像识别方法及装置

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020214661A1 (en) * 2019-04-15 2020-10-22 Ohio State Innovation Foundation Material identification through image capture of raman scattering
CN110263681B (zh) * 2019-06-03 2021-07-27 腾讯科技(深圳)有限公司 面部表情的识别方法和装置、存储介质、电子装置
CN110660074B (zh) * 2019-10-10 2021-04-16 北京同创信通科技有限公司 一种建立废钢等级划分神经网络模型方法
CN110766061B (zh) * 2019-10-15 2022-05-31 武汉中海庭数据技术有限公司 一种道路场景匹配方法及装置
CN110827129B (zh) * 2019-11-27 2022-11-11 中国联合网络通信集团有限公司 一种商品推荐方法及装置
CN111242178A (zh) * 2020-01-02 2020-06-05 杭州睿琪软件有限公司 对象识别方法、装置及设备
CN111504300B (zh) * 2020-04-13 2022-04-08 五邑大学 感知机器人的服务质量评价方法、装置和存储介质
CN111709428B (zh) * 2020-05-29 2023-09-15 北京百度网讯科技有限公司 图像中关键点位置的识别方法、装置、电子设备及介质
CN113177472B (zh) * 2021-04-28 2024-03-29 北京百度网讯科技有限公司 动态手势识别方法、装置、设备以及存储介质
CN113255543B (zh) * 2021-06-02 2023-04-07 西安电子科技大学 基于图卷积网络的面部表情识别方法
CN113505716B (zh) * 2021-07-16 2022-07-01 重庆工商大学 静脉识别模型的训练方法、静脉图像的识别方法及装置
CN115187705B (zh) * 2022-09-13 2023-01-24 之江实验室 一种语音驱动人脸关键点序列生成方法及装置
CN116543445B (zh) * 2023-06-29 2023-09-26 新励成教育科技股份有限公司 一种演讲者面部表情分析方法、系统、设备及存储介质
CN117315745B (zh) * 2023-09-19 2024-05-28 中影年年(北京)科技有限公司 基于机器学习的面部表情捕捉方法及系统

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620669A (zh) * 2008-07-01 2010-01-06 邹采荣 一种人脸身份和表情的同步识别方法
US20170116467A1 (en) * 2015-03-18 2017-04-27 Adobe Systems Incorporated Facial Expression Capture for Character Animation
CN107169413A (zh) * 2017-04-12 2017-09-15 上海大学 一种基于特征块权重化的面部表情识别方法
CN108256469A (zh) * 2018-01-16 2018-07-06 华中师范大学 脸部表情识别方法及装置
CN109117795A (zh) * 2018-08-17 2019-01-01 西南大学 基于图结构的神经网络表情识别方法
CN109446980A (zh) * 2018-10-25 2019-03-08 华中师范大学 表情识别方法及装置
CN109711378A (zh) * 2019-01-02 2019-05-03 河北工业大学 人脸表情自动识别方法
CN109766840A (zh) * 2019-01-10 2019-05-17 腾讯科技(深圳)有限公司 人脸表情识别方法、装置、终端及存储介质
CN110263681A (zh) * 2019-06-03 2019-09-20 腾讯科技(深圳)有限公司 面部表情的识别方法和装置、存储介质、电子装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016084071A1 (en) * 2014-11-24 2016-06-02 Isityou Ltd. Systems and methods for recognition of faces e.g. from mobile-device-generated images of faces
US10339369B2 (en) * 2015-09-16 2019-07-02 Intel Corporation Facial expression recognition using relations determined by class-to-class comparisons
CN107657204A (zh) * 2016-07-25 2018-02-02 中国科学院声学研究所 深层网络模型的构建方法及人脸表情识别方法和系统
KR20180057096A (ko) * 2016-11-21 2018-05-30 삼성전자주식회사 표정 인식과 트레이닝을 수행하는 방법 및 장치
US10417483B2 (en) * 2017-01-25 2019-09-17 Imam Abdulrahman Bin Faisal University Facial expression recognition
CN107358169A (zh) * 2017-06-21 2017-11-17 厦门中控智慧信息技术有限公司 一种人脸表情识别方法及人脸表情识别装置
CN108038456B (zh) * 2017-12-19 2024-01-26 中科视拓(北京)科技有限公司 一种人脸识别系统中的防欺骗方法
CN108090460B (zh) * 2017-12-29 2021-06-08 天津科技大学 基于韦伯多方向描述子的人脸表情识别特征提取方法
CN109684911B (zh) * 2018-10-30 2021-05-11 百度在线网络技术(北京)有限公司 表情识别方法、装置、电子设备及存储介质
CN109815924B (zh) * 2019-01-29 2021-05-04 成都旷视金智科技有限公司 表情识别方法、装置及系统
CN110245573B (zh) * 2019-05-21 2023-05-26 平安科技(深圳)有限公司 一种基于人脸识别的签到方法、装置及终端设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620669A (zh) * 2008-07-01 2010-01-06 邹采荣 一种人脸身份和表情的同步识别方法
US20170116467A1 (en) * 2015-03-18 2017-04-27 Adobe Systems Incorporated Facial Expression Capture for Character Animation
CN107169413A (zh) * 2017-04-12 2017-09-15 上海大学 一种基于特征块权重化的面部表情识别方法
CN108256469A (zh) * 2018-01-16 2018-07-06 华中师范大学 脸部表情识别方法及装置
CN109117795A (zh) * 2018-08-17 2019-01-01 西南大学 基于图结构的神经网络表情识别方法
CN109446980A (zh) * 2018-10-25 2019-03-08 华中师范大学 表情识别方法及装置
CN109711378A (zh) * 2019-01-02 2019-05-03 河北工业大学 人脸表情自动识别方法
CN109766840A (zh) * 2019-01-10 2019-05-17 腾讯科技(深圳)有限公司 人脸表情识别方法、装置、终端及存储介质
CN110263681A (zh) * 2019-06-03 2019-09-20 腾讯科技(深圳)有限公司 面部表情的识别方法和装置、存储介质、电子装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560730A (zh) * 2020-12-22 2021-03-26 电子科技大学中山学院 一种基于Dlib与人工神经网络的人脸表情识别方法
CN113239875A (zh) * 2021-06-01 2021-08-10 恒睿(重庆)人工智能技术研究院有限公司 人脸特征的获取方法、系统、装置及计算机可读存储介质
CN113239875B (zh) * 2021-06-01 2023-10-17 恒睿(重庆)人工智能技术研究院有限公司 人脸特征的获取方法、系统、装置及计算机可读存储介质
CN113688715A (zh) * 2021-08-18 2021-11-23 山东海量信息技术研究院 面部表情识别方法及系统
CN114612901A (zh) * 2022-03-15 2022-06-10 腾讯科技(深圳)有限公司 图像变化识别方法、装置、设备和存储介质
CN116934754A (zh) * 2023-09-18 2023-10-24 四川大学华西第二医院 基于图神经网络的肝脏影像识别方法及装置
CN116934754B (zh) * 2023-09-18 2023-12-01 四川大学华西第二医院 基于图神经网络的肝脏影像识别方法及装置

Also Published As

Publication number Publication date
CN110263681A (zh) 2019-09-20
US20210406525A1 (en) 2021-12-30
CN110263681B (zh) 2021-07-27

Similar Documents

Publication Publication Date Title
WO2020244434A1 (zh) 面部表情的识别方法、装置、电子设备及存储介质
WO2018188453A1 (zh) 人脸区域的确定方法、存储介质、计算机设备
WO2019174439A1 (zh) 图像识别方法、装置、终端和存储介质
CN110991380B (zh) 人体属性识别方法、装置、电子设备以及存储介质
WO2021259005A1 (zh) 视频微表情识别方法、装置、计算机设备及存储介质
CN113221663B (zh) 一种实时手语智能识别方法、装置及系统
CN111401216A (zh) 图像处理、模型训练方法、装置、计算机设备和存储介质
CN110070484B (zh) 图像处理、图像美化方法、装置和存储介质
WO2022073282A1 (zh) 一种基于特征交互学习的动作识别方法及终端设备
WO2021184754A1 (zh) 视频对比方法、装置、计算机设备和存储介质
CN111832592A (zh) Rgbd显著性检测方法以及相关装置
EP3839768A1 (en) Mediating apparatus and method, and computer-readable recording medium thereof
US12039732B2 (en) Digital imaging and learning systems and methods for analyzing pixel data of a scalp region of a users scalp to generate one or more user-specific scalp classifications
CN110046574A (zh) 基于深度学习的安全帽佩戴识别方法及设备
US20200372639A1 (en) Method and system for identifying skin texture and skin lesion using artificial intelligence cloud-based platform
CN112395979A (zh) 基于图像的健康状态识别方法、装置、设备及存储介质
US20220067888A1 (en) Image processing method and apparatus, storage medium, and electronic device
CN110210344B (zh) 视频动作识别方法及装置、电子设备、存储介质
CN113191479A (zh) 联合学习的方法、系统、节点及存储介质
CN112862828A (zh) 一种语义分割方法、模型训练方法及装置
CN110163861A (zh) 图像处理方法、装置、存储介质和计算机设备
CN104794444A (zh) 一种即时视频中的表情识别方法和电子设备
CN113822790A (zh) 一种图像处理方法、装置、设备及计算机可读存储介质
CN111612090B (zh) 基于内容颜色交叉相关的图像情感分类方法
JPWO2020105146A1 (ja) 情報処理装置、制御方法、及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20818211

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20818211

Country of ref document: EP

Kind code of ref document: A1