CN112115860A - Face key point positioning method and device, computer equipment and storage medium - Google Patents

Face key point positioning method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112115860A
CN112115860A CN202010985328.9A CN202010985328A CN112115860A CN 112115860 A CN112115860 A CN 112115860A CN 202010985328 A CN202010985328 A CN 202010985328A CN 112115860 A CN112115860 A CN 112115860A
Authority
CN
China
Prior art keywords
layer
feature map
face
network
face image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010985328.9A
Other languages
Chinese (zh)
Inventor
张少林
宁欣
段鹏飞
石园
孙琳钧
王镇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wave Kingdom Co ltd
Original Assignee
Shenzhen Wave Kingdom Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wave Kingdom Co ltd filed Critical Shenzhen Wave Kingdom Co ltd
Priority to CN202010985328.9A priority Critical patent/CN112115860A/en
Publication of CN112115860A publication Critical patent/CN112115860A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/245Aligning, centring, orientation detection or correction of the image by locating a pattern; Special marks for positioning

Abstract

The application relates to a method and a device for positioning key points of a human face, computer equipment and a storage medium. The method comprises the following steps: performing prediction operation on a face image to be processed through a coding sub-network of the trained face key point positioning model, outputting a first feature map, and inputting the first feature map and the face image to be processed into a decoding processing layer of a decoding sub-network in the face key point positioning model; carrying out deconvolution processing on the face image to be processed through a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, carrying out feature screening on the second feature map, splicing the screened feature map with the second feature map, outputting a third feature map, inputting the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by the last decoding processing layer of the decoding sub-network; and calculating to obtain the feature information of the key points of the human face according to the texture position map and the preset index. By adopting the method, the positioning efficiency of the key points of the human face can be improved.

Description

Face key point positioning method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of image recognition, and in particular, to a method and an apparatus for locating key points of a human face, a computer device, and a storage medium.
Background
With the development of internet technology, face key point positioning technology has been widely applied in various fields such as face recognition, face tracking, face pose expression analysis, face animation and the like. The face key points may be face organs such as eyes, ears, nose, etc. The positioning of the face key points refers to determining the position information of the face key points in the face image to be processed. In the traditional mode, a face key point positioning technology based on deep learning appears. For example, the trained convolutional neural network model is used for positioning the key points of the face image, so that the positioning accuracy of the key points of the face is improved. However, in the conventional method, all features in the face image need to be extracted to perform face key point positioning, which results in more time consumption for feature processing, resulting in low face key point positioning efficiency and failure to meet the real-time requirement.
Disclosure of Invention
In view of the foregoing, there is a need to provide a method, an apparatus, a computer device and a storage medium for locating face key points, which can improve the efficiency of locating face key points.
A face key point positioning method, the method comprising:
acquiring a face image to be processed;
inputting the face image to be processed into a trained face key point positioning model, wherein the face key point positioning model comprises a coding sub-network and a decoding sub-network;
performing prediction operation on the face image to be processed through the coding sub-network, outputting a first feature map, and inputting the first feature map and the face image to be processed to a decoding processing layer of the decoding sub-network;
carrying out deconvolution processing on the face image to be processed through a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, carrying out feature screening on the second feature map, carrying out splicing processing on the screened feature map and the second feature map, outputting a third feature map, and inputting the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork;
and calculating to obtain the feature information of the key points of the face corresponding to the face image to be processed according to the texture position image and a preset index.
In one embodiment, the first preset deconvolution layer includes an attention layer and a connection layer, the feature screening of the second feature map, the splicing of the screened features with the second feature map, and the outputting of the third feature map includes:
extracting the weight parameters corresponding to the channel features in the second feature map through the attention layer to generate an extraction result;
inputting the extraction result and the second feature map into a connecting layer, splicing the extraction result and the second feature map through the connecting layer, and outputting a third feature map.
In one embodiment, the performing, by the coding sub-network, a prediction operation on the image to be processed and outputting a first feature map includes:
performing prediction operation on the image to be processed through a coding processing layer of the coding subnetwork to obtain an output characteristic diagram;
performing fusion processing on the output characteristic diagram of the last coding processing layer in the coding sub-network and the input characteristic diagram of the last coding processing layer;
and inputting the fused feature map into the next coding processing layer until the last coding processing layer of the coding sub-network outputs the first feature map.
In one embodiment, the decoding processing layers include a fusion layer, and the inputting the third feature map and the first feature map into a next decoding processing layer until a last decoding processing layer of the decoding sub-network outputs a texture position map includes:
inputting the third feature map and the first feature map into a fusion layer, and performing fusion processing on the third feature map and the first feature map through the fusion layer to obtain a fusion feature map;
and taking the fused feature map as the input of the next decoding processing layer.
In one embodiment, before the deconvolving the face image to be processed by the first preset deconvolution layer in the decoding processing layers, the method further includes:
performing feature extraction on the face image to be processed through an input layer in the decoding processing layer;
inputting the extracted features into a first preset deconvolution layer, and performing feature deletion processing on the extracted features through the first preset deconvolution layer.
In one embodiment, before the acquiring the face image to be processed, the method further includes:
acquiring a sample face image set;
inputting the sample face image set into a preset face key point positioning model, and outputting a first feature map through a coding sub-network of the preset face key point positioning model;
inputting the first feature map and the sample face image set into a decoding sub-network of a preset face key point positioning model, and outputting a texture position map and a loss function value of the preset face key point positioning model;
and when the loss function value meets a preset condition, stopping training the preset face key point positioning model to obtain a trained face key point positioning model.
In one embodiment, the acquiring the sample face image set includes:
acquiring an original face image set;
acquiring position labels corresponding to the face images in the original face image set;
performing three-dimensional face reconstruction according to the position label, performing space rendering on the reconstructed face to obtain a sample texture position map, and taking the sample texture position map as a training label;
and carrying out normalization processing on the original face image set and the training labels to obtain a sample face image set.
A face keypoint locating apparatus, the apparatus comprising:
the communication module is used for acquiring a face image to be processed;
the first output module is used for inputting the face image to be processed into a trained face key point positioning model, and the face key point positioning model comprises a coding sub-network and a decoding sub-network; performing prediction operation on the face image to be processed through the coding sub-network, outputting a first feature map, and inputting the first feature map and the face image to be processed to a decoding processing layer of the decoding sub-network;
a second output module, configured to perform deconvolution processing on the face image to be processed by using a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, perform feature screening on the second feature map, perform splicing processing on the screened feature map and the second feature map, output a third feature map, and input the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork;
and the computing module is used for computing the face key point characteristic information corresponding to the face image to be processed according to the texture position image and a preset index.
A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, the processor implementing the steps in the various method embodiments described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the respective method embodiment described above.
The face key point positioning method, the device, the computer equipment and the storage medium acquire a face image to be processed, input the face image to be processed into a trained face key point positioning model, wherein the face key point positioning model comprises a coding sub-network and a decoding sub-network, the face image to be processed is predicted by the coding sub-network, a first feature map is output, the first feature map and the face image to be processed are input to a decoding processing layer of the decoding sub-network, thereby carrying out deconvolution processing on the face image to be processed through a first preset deconvolution layer in the decoding processing layer to obtain a second feature map, and performing feature screening on the second feature map, splicing the screened feature map and the second feature map, outputting a third feature map, and inputting the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork. And then the server calculates and obtains the feature information of the key points of the face corresponding to the face image to be processed according to the texture position image and the preset index. The effective features with larger weight parameters can be determined by carrying out deconvolution processing and feature screening on the face image to be processed through the first preset deconvolution layer in the decoding sub-network, so that the effective features are focused, and the time waste caused by ineffective feature processing is reduced, thereby improving the prediction efficiency of the face key point positioning model, further improving the face key point positioning efficiency, and being applied to the mode with higher real-time requirement. Meanwhile, the face image to be processed is subjected to deconvolution processing and feature screening through the decoding sub-network, compared with the traditional mode, more abundant features can be generated, and the accuracy of face key point positioning is improved.
Drawings
FIG. 1 is a diagram of an exemplary embodiment of an application environment for a method for locating key points in a human face;
FIG. 2 is a schematic flow chart illustrating a method for locating key points of a human face according to an embodiment;
FIG. 3 is a flowchart illustrating a step of performing feature filtering on the second feature map, performing stitching processing on the filtered features and the second feature map, and outputting a third feature map in one embodiment;
FIG. 4 is a diagram of a first predetermined deconvolution layer in a decoding subnetwork of the face keypoint location model in an embodiment;
FIG. 5 is a diagram illustrating the decoding of residual blocks in a sub-network of the face keypoint localization model in one embodiment;
FIG. 6 is a diagram of a face keypoint localization model in an embodiment;
FIG. 7 is a flowchart illustrating the training steps of the face keypoint localization model in one embodiment;
FIG. 8 is a flowchart illustrating the steps of obtaining a sample to-be-processed face image set according to an embodiment;
FIG. 9 is a block diagram of an embodiment of a face keypoint locating apparatus;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The method for positioning the key points of the human face can be applied to the application environment shown in fig. 1. Wherein the terminal 102 and the server 104 communicate via a network. The acquisition mode of the face image to be processed may be various, and the terminal 102 may send the face key point positioning request to the server 104, and the server 104 analyzes the face key point positioning request to obtain the face image to be processed. Or the terminal sends a face key point positioning request to the server 104, the server 104 analyzes the face key point positioning request to obtain image data, and the face image to be processed is extracted from the image data. The server 104 inputs the face image to be processed into the trained face key point positioning model, and analyzes the face key point positioning model to obtain a coding sub-network and a decoding sub-network. The server 104 performs prediction operation on the face image to be processed through the coding sub-network, outputs a first feature map, and inputs the first feature map and the face image to be processed to a decoding processing layer of the coding sub-network. The server 104 performs deconvolution processing on the first feature map through a first preset deconvolution layer in the coding processing layers to obtain a second feature map, performs feature screening on the second feature map, performs splicing processing on the screened feature map and the second feature map, outputs a third feature map, and inputs the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork. And the server 104 calculates the feature information of the key points of the face corresponding to the face image to be processed according to the texture position map and the preset index. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In an embodiment, as shown in fig. 2, a method for locating face key points is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, obtaining a face image to be processed.
The face image to be processed is an image including a face target, and the face image to be processed is an image needing face key point positioning. The face key points may be facial organs of the face object, such as eyes, ears, nose, etc. The positioning of the face key points refers to determining the position information of the face key points in the face image to be processed.
In various application fields of the face key point positioning technology, such as face recognition, face tracking, face pose expression analysis, face animation and the like, a server can receive a face key point positioning request sent by a terminal. In one embodiment, the face keypoint location request may carry a face image to be processed. When the server receives the face key point positioning request, the face key point positioning request is analyzed, and a face image to be processed is obtained. The face image to be processed may be obtained by the terminal performing face detection on pre-stored image data. In one embodiment, the face key point positioning request may carry image data pre-stored in the terminal, and when the server receives the face key point positioning request, the server parses the face key point positioning request to obtain the image data. The server extracts a face image to be processed from the image data. When the face image cannot be extracted from the image data, it indicates that no face exists in the image data, and the image data may be deleted or ignored. Further, the server can call a face detection model, and the face detection model is used for carrying out face detection on the image data to determine a face area corresponding to the face target. The face region may be a detection frame corresponding to the face target. And the server extracts the image corresponding to the face area as the face image to be processed.
Step 204, inputting the face image to be processed into the trained face key point positioning model, wherein the face key point positioning model comprises a coding sub-network and a decoding sub-network.
And step 206, performing prediction operation on the face image to be processed through the coding sub-network, outputting a first feature map, and inputting the first feature map and the face image to be processed into a decoding processing layer of the decoding sub-network.
The server is pre-configured with a trained face key point positioning model. The face key point positioning model is obtained by training a sample image data set, wherein the sample image data set comprises a large number of face images marked with position labels. And after the server acquires the face image to be processed, calling the trained face key point positioning model. The face key point positioning model is used for processing a face image to be processed, wherein the face image to be processed comprises an encoding stage and a decoding stage, and face key point positioning is achieved. And the server analyzes the face key point positioning model to obtain a coding sub-network and a decoding sub-network. The encoding subnetwork is the network used for the encoding phase processing and the decoding subnetwork is the network used for the decoding phase.
The coding sub-network may comprise a plurality of coding processing layers, in particular an input layer, a convolutional layer, a plurality of residual blocks, etc. The decoding subnetwork may include a plurality of decoding processing layers, and specifically may include a plurality of preset deconvolution layers, Add layers, a plurality of common deconvolution layers, convolution layers, output layers, and the like. Wherein, the preset Deconvolution layer is an Efficient Deconvolution layer (Efficient Deconvolution layer), and the Efficient Deconvolution layer has higher characteristic processing efficiency compared with a common Deconvolution layer. The normal deconvolution layer may also be referred to as a normal Transposed convolution layer (Transposed convolution layer).
The server inputs the face image to be processed into the trained face key point positioning model, the image to be processed is used as the input of the coding sub-network, the prediction operation is carried out on the image to be processed through a plurality of coding processing layers in the coding sub-network, and the first feature map is output through the last coding processing layer of the coding sub-network. The first feature map is an image which is output by the coding sub-network in the coding stage and used for representing the positions of the key points of the human face. And then inputting the first feature map and the face image to be processed into a decoding processing layer of a decoding sub-network.
And 208, performing deconvolution processing on the first feature map through a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, performing feature screening on the second feature map, performing splicing processing on the screened feature map and the second feature map, outputting a third feature map, and inputting the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding sub-network.
The server takes the first feature map output by the coding sub-network and the face image to be processed as the input of the coding sub-network, and the coding sub-network may include a plurality of decoding processing layers, and specifically may include a plurality of preset deconvolution layers, Add layers, a plurality of common deconvolution layers, a convolution layer, and the like. The preset deconvolution layer is an efficient deconvolution layer, the efficient deconvolution layer can comprise a common convolution layer, an attention layer and a connecting layer, deconvolution processing is carried out on the face image to be processed through the common convolution layer in the first preset deconvolution layer, and a second feature map is output. And preserving the characteristic diagram output by the common convolution layer in the first preset deconvolution layer on one hand and inputting the characteristic diagram into the attention layer on the other hand. The retained feature maps are the same as the feature map of the input attention layer, and are both the second feature maps. And performing feature screening on the second feature map through the attention layer, outputting the screened feature map, inputting the screened feature map and the retained feature map output by the common convolutional layer, namely the second feature map, into the connecting layer, performing splicing processing on the screened feature map and the second feature map through the connecting layer, and outputting a third feature map. And the third feature map is an image which is output by the first preset deconvolution layer of the decoding sub-network in the decoding stage and is used for representing the positions of the key points of the human face.
And the decoding sub-network inputs the third feature map output by the first preset deconvolution layer into the next decoding processing layer until the texture position map is output by the last decoding processing layer of the decoding sub-network. The texture position image refers to a face texture image corresponding to the face of the face target in the face image to be processed. The texture image, which may also be referred to as a UV location map, is a three-dimensional unfolded surface image. UV refers to the abbreviation of UV texture mapping coordinates, defines the information of the position of each point on an image, U and V are the coordinates of the image in the horizontal direction and the vertical direction of a display respectively, and the value is generally 0-1. Each point in the UV position map is mutually connected with the three-dimensional face model, and the position of the surface texture mapping can be determined, namely each point in the UV position map can accurately correspond to the surface of the face model, so that the three-dimensional face key point prediction is carried out.
And step 210, calculating to obtain the face key point characteristic information corresponding to the face image to be processed according to the texture position image and the preset index.
The server may obtain the preset index. The preset index is used for calculating the feature information of the key points of the face corresponding to the texture position image. The face key point feature information refers to position coordinates of key points of the face. And the server determines face key points of the face in the face image to be processed according to the preset index and the texture position image, and further acquires the position coordinates of each face key point in the face.
In the embodiment, the server acquires a face image to be processed, inputs the face image to be processed into a trained face key point positioning model, the face key point positioning model comprises a coding sub-network and a decoding sub-network, the face image to be processed is predicted by the coding sub-network, a first feature map is output, the first feature map and the face image to be processed are input to a decoding processing layer of the decoding sub-network, thereby carrying out deconvolution processing on the face image to be processed through a first preset deconvolution layer in the decoding processing layer to obtain a second feature map, and performing feature screening on the second feature map, splicing the screened feature map and the second feature map, outputting a third feature map, and inputting the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork. And then the server calculates and obtains the feature information of the key points of the face corresponding to the face image to be processed according to the texture position image and the preset index. The effective features with larger weight parameters can be determined by carrying out deconvolution processing and feature screening on the face image to be processed through the first preset deconvolution layer in the decoding sub-network, so that the effective features are focused, and the time waste caused by ineffective feature processing is reduced, thereby improving the prediction efficiency of the face key point positioning model, further improving the face key point positioning efficiency, and being applied to the mode with higher real-time requirement. Meanwhile, the face image to be processed is subjected to deconvolution processing and feature screening through the decoding sub-network, compared with the traditional mode, more abundant features can be generated, and the accuracy of face key point positioning is improved.
In one embodiment, before performing deconvolution processing on the face image to be processed by a first preset deconvolution layer in the decoding processing layer, the method further includes: extracting the features of the face image to be processed through an input layer in the decoding processing layer; and inputting the extracted features into a first preset deconvolution layer, and performing feature deletion processing on the extracted features through the first preset deconvolution layer.
The first pre-defined deconvolution layer is a high-efficiency deconvolution layer that can perform feature pruning operations. And the server inputs the first feature map and the face image to be processed into a decoding processing layer of a decoding sub-network, wherein the decoding processing layer comprises an input layer, and feature extraction is carried out on the face image to be processed through the input layer to generate a feature extraction result. And inputting the feature extraction result into a first preset deconvolution layer in the decoding processing layer, selecting the first half features of the feature channels in the feature extraction result by the first preset deconvolution layer through feature deletion operation, and inputting the selected features into a subsequent common deconvolution layer for deconvolution processing, so that the time consumption of half of the common deconvolution layer is reduced, and the network forwarding speed is accelerated.
In one embodiment, the decoding process layers include a fusion layer, and inputting the third feature map and the first feature map into a next decoding process layer until a last decoding process layer of the decoding sub-network outputs the texture position map includes: inputting the third feature map and the first feature map into the fusion layer, and performing fusion processing on the third feature map and the first feature map through the fusion layer to obtain a fusion feature map; and taking the fused feature map as the input of the next decoding processing layer.
The server may retain the first profile output by the coding sub-network. And after the first preset deconvolution layer outputs the third feature map, the third feature map output by the first preset deconvolution layer and the first feature map output by the coding sub-network are fused through a fusion layer. For example, the fusion layer may be an Add layer. For example, the mode of the fusion process may be feature map addition. The server thus takes the fused feature map resulting from the fusion process as input to the next decoding process layer.
In this embodiment, feature enhancement is realized by fusing the third feature map output by the first preset deconvolution layer with the first feature map output by the coding subnetwork, so that more information-rich features are obtained, and the accuracy of face key point positioning is effectively improved.
In one embodiment, as shown in fig. 3, the first predetermined deconvolution layer includes an attention layer and a connection layer, the second feature map is subjected to feature screening, the screened features are spliced with the second feature map, and the step of outputting the third feature map includes:
and step 302, extracting the weight parameters corresponding to the channel features in the second feature map through the attention layer, and generating an extraction result.
And step 304, inputting the extraction result and the second feature map into a connecting layer, splicing the extraction result and the second feature map through the connecting layer, and outputting a third feature map.
The first pre-defined deconvolution layer is a high-efficiency deconvolution layer, which may include a normal convolution layer, an attention layer, and a connection layer. For example, the attention layer may be an SE (Squeeze-and-Excitation) layer. And the server performs deconvolution processing on the coding sub-network through a common convolutional layer in the first preset deconvolution layers, and outputs a second feature map. The feature map output by the common convolutional layer is reserved on one hand and input to the attention layer on the other hand. The retained feature maps are the same as the feature map of the input attention layer, and are both the second feature maps. And extracting the weight parameters corresponding to the channel features in the second feature map by the attention layer in an attention mechanism mode, improving the weight parameters corresponding to the useful features, and generating an extraction result according to the weight parameters corresponding to the channel features. And inputting the extraction result and the second feature map into the connecting layer, splicing the extraction result and the second feature map according to a channel or a dimension through the connecting layer, and outputting a third feature map through the connecting layer. For example, the connection layer may be a Concat connection layer. And then, the decoding sub-network inputs the third feature map output by the connecting layer of the first preset deconvolution layer into the next decoding processing layer until the texture position map is output by the last decoding processing layer of the decoding sub-network.
In this embodiment, the attention layer extracts the weight parameter corresponding to each channel feature in the second feature map, and generates an extraction result. And inputting the extraction result and the second feature map into the connecting layer, splicing the extraction result and the second feature map through the connecting layer, and outputting a third feature map. Each channel feature can be assigned with a learnable parameter, namely a weight parameter, so that the channel features are distinguished lightly and heavily, and the larger the weight parameter is, the more important the features are. Therefore, effective features can be determined, the effective features are focused on, time waste caused by ineffective feature processing is reduced, the prediction efficiency of the face key point positioning model is further improved, and the face key point positioning efficiency is further effectively improved.
In one embodiment, as shown in fig. 4, the first predetermined deconvolution layer of the decoding subnetwork may include a normal deconvolution layer, a BN layer, and a Relu layer connected in sequence, the Relu layer being connected to the attention layer and the connection layer, respectively, and the attention layer being connected to the connection layer. Wherein the first predetermined deconvolution layer may include a Feature delete select operation. And inputting the face image to be processed into a decoding processing layer of a decoding sub-network, and performing feature extraction on the face image to be processed through an input layer of the decoding processing layer to obtain a feature extraction result. When the feature extraction result is input to the first preset deconvolution layer in the decoding processing layer, feature deletion is performed through a feature deletion operation, and the deleted feature is output. And inputting the deleted features into a common deconvolution layer, and performing deconvolution processing on the deleted features by the common deconvolution layer to output a second feature map. The second feature map is input to the BN layer, i.e., the BatchNorm layer, through which the second feature map can be normalized to the same distribution, such as a distribution with a mean of 0 and a variance of 1, so that the data distribution is easier to learn by the network and is less prone to overfitting. The output of the BN layer is activated through the Relu layer, so that the sparsity of a network can be caused, overfitting is relieved, the model is closer to a real neuron activation model, the model can be more easily converged, and the problem of gradient disappearance is avoided. The output of the Relu layer may be partially retained and partially input to the attention layer, and the output of the attention layer and the retained output of the Relu layer may be input to a connection layer through which the third profile is output.
In one embodiment, performing a prediction operation on the image to be processed by the coding sub-network, and outputting the first feature map includes: performing prediction operation on an image to be processed through a coding processing layer of a coding sub-network to obtain an output characteristic diagram; carrying out fusion processing on the output characteristic diagram of the previous coding processing layer in the coding sub-network and the input characteristic diagram of the previous coding processing layer; and inputting the fused feature map into the next coding processing layer until the last coding processing layer of the coding sub-network outputs the first feature map.
The coding subnetwork may comprise a plurality of coding processing layers, in particular may comprise a convolutional layer, a plurality of residual blocks, etc. The output of each coding processing layer can be collectively called as an output characteristic map by performing a preset operation on the image to be processed by the plurality of coding processing layers of the coding sub-network. In the encoding process of the encoding subnetwork, the output feature map of the previous encoding processing layer and the input feature map of the previous encoding processing layer may be fused to obtain a fused feature map, and the fused feature map is used as the input of the next encoding processing layer and input to the next encoding processing layer until the last encoding processing layer of the encoding subnetwork outputs the first feature map.
In one embodiment, before the fused feature map is input to the next encoding processing layer, the fused feature map may be further normalized to obtain a normalized feature map. And performing activation processing on the feature map after the standardization processing, and taking the feature map after the activation processing as the input of the next coding processing layer. The activation function employed by the activation process may be a Relu function.
In this embodiment, the output feature map of the previous encoding processing layer in the encoding subnetwork and the input feature map of the previous encoding processing layer are fused, and the fused feature maps are input to the next encoding processing layer until the last encoding processing layer of the encoding subnetwork outputs the first feature map. The method can extract the features with higher resolution in the face image to be processed, and can fuse the features with high resolution with the features with lower resolution extracted in the decoding process through the decoding process of the subsequent decoding sub-network, so that richer features can be obtained, more accurate texture position maps can be obtained, and the accuracy of face key point positioning can be further improved.
In one embodiment, as shown in fig. 5, a residual block of a coding sub-network may include a plurality of convolutional layers, an Add layer, a BN layer, and a Relu layer. For example, the number of convolutional layers may be 4. The convolution layers are connected in sequence, the last convolution layer is connected with the Add layer, and the Add layer, the BN layer and the Relu layer are connected in sequence. And the Add layer is used for adding the output characteristic of the last convolution layer of the residual block and the input characteristic of the residual block to realize characteristic fusion, and then the output of the Add layer passes through the BN layer and the Relu layer to obtain the output characteristic of the residual block.
In one embodiment, as shown in fig. 6, for example, the network structure of the face keypoint location model may include: the device comprises an input layer, a first convolutional layer, a plurality of residual blocks, a first Add layer, a plurality of efficient deconvolution layers, a first deconvolution layer, a second convolutional layer, a second deconvolution layer and an output layer. The number of feature maps output by the next residual block is twice the number of feature maps output by the last residual block. Therefore, enough abundant features can be extracted, the number of the feature channels is increased through the channels, and the nonlinear learning capability of the model is enhanced.
The high-efficiency deconvolution layer is a preset deconvolution layer, and the first deconvolution layer and the second deconvolution layer are common deconvolution layers. The input layer, the first convolution layer and the plurality of residual blocks are sequentially connected, the first Add layer is respectively connected with the last residual block and the first efficient deconvolution layer, the output of the last residual block and the output of the first efficient deconvolution layer are fused by the first Add layer, and the output of the first Add layer is used as the input of the second efficient deconvolution layer. The second efficient deconvolution layer, the subsequent efficient deconvolution layer, the first deconvolution layer, the second convolution layer, the second deconvolution layer and the output layer are connected in sequence.
In the embodiment, the efficient deconvolution layer is used for replacing the common deconvolution layer, so that the feature processing speed is increased, and the problem that the real-time performance of the conventional method at the mobile terminal is difficult to achieve can be solved. By adding the first Add layer which is respectively connected with the last residual block and the first high-efficiency deconvolution layer, the output of the last residual block and the output of the first high-efficiency deconvolution layer can be fused, so that feature enhancement is realized, and the prediction accuracy of the model is improved.
Furthermore, the face key point positioning model can also comprise a Euclidean distance layer connected to the second deconvolution layer, the Euclidean distance layer receives the texture position image output by the second deconvolution layer, the loss function value of the face key point positioning model in the training process is calculated, and the network parameters of the model are updated according to the loss function value by adopting a gradient descent algorithm.
In an embodiment, before obtaining the face image to be processed, the method further includes a training step of the face key point location model, specifically including:
step 702, a sample face image set is obtained.
Step 704, inputting the sample face image set to a preset face key point positioning model, and outputting a first feature map through a coding sub-network of the preset face key point positioning model.
Step 706, inputting the first feature map and the sample face image set into a decoding sub-network of the preset face key point positioning model, and outputting a texture position map and a loss function value of the preset face key point positioning model.
And 708, stopping training the preset face key point positioning model when the loss function value meets a preset condition to obtain the trained face key point positioning model.
The sample face image set refers to a training set used for training a face key point positioning model, and is a single-channel gray-scale image. It is to be understood that the face images in the sample face image set may also include images taken directly of the human face. The sample facial image data set may also include a validation set for determining parameters that control the complexity of the network structure or model, and a test set for evaluating the performance of the network or model.
The preset human face key point positioning model is a human face key point positioning model needing training. The preset human face key point positioning model comprises an encoding sub-network and a decoding sub-network. The encoding subnetwork is the network used for the encoding phase processing and the decoding subnetwork is the network used for the decoding phase. The coding sub-network may comprise a plurality of coding processing layers, in particular an input layer, a convolutional layer, a plurality of residual blocks, etc. The decoding subnetwork may include a plurality of decoding processing layers, and specifically may include a plurality of preset deconvolution layers, Add layers, a plurality of common deconvolution layers, convolution layers, output layers, and the like. Wherein, the preset Deconvolution layer is an Efficient Deconvolution layer (Efficient Deconvolution layer), and the Efficient Deconvolution layer has higher characteristic processing efficiency compared with a common Deconvolution layer. The normal deconvolution layer may also be referred to as a normal Transposed convolution layer (Transposed convolution layer).
The texture position image is a face texture image corresponding to the face of the face target in the face image to be processed. In the process of model training, a loss function is used for measuring the error between a predicted value and a true value, the loss function value is generally a non-negative number, the smaller the loss function value is, the smaller the error is, and a square function is generally adopted. The server inputs the sample face image set to a preset face key point positioning model, the sample face image set is used as the input of a coding sub-network, the sample face image set is subjected to prediction operation through a plurality of coding processing layers in the coding sub-network, and a first feature map is output through the last coding processing layer of the coding sub-network. And inputting the first feature map and the sample face image set into a decoding sub-network of the preset face key point positioning model, and outputting the texture position map and the loss function value of the preset face key point positioning model. Further, in the decoding stage, the decoding sub-network may calculate a loss function value of the face key point location model according to the texture location map and the location label of the sample face image set, and update a network parameter of the model according to the loss function value by using a gradient descent algorithm. The prediction process of the coding sub-network and the decoding sub-network in the preset human face key point positioning model is the same as the positioning process of the human face key point positioning model trained in the human face key point positioning method, and the description is omitted here.
And the loss function value is used for evaluating a preset human face key point positioning model. In the process of training the preset human face key point positioning model, the loss function value is continuously reduced, when the loss function value meets the preset condition, namely the loss function value tends to be stable and does not change any more, the loss function value tends to be convergent at the moment, and the training of the preset human face key point positioning model is finished, the training of the preset human face key point positioning model can be stopped, and the model at the moment is used as the trained human face key point positioning model.
In this embodiment, the deconvolution processing and feature screening are performed on the first feature map by decoding the first preset deconvolution layer in the sub-network, so that the effective features with larger weight parameters can be determined, the effective features are focused on, and the time waste caused by ineffective feature processing is reduced, so that the prediction efficiency of the face key point positioning model is improved, the face key point positioning efficiency is improved, and the method can be applied to a mode with higher real-time requirement. Meanwhile, the first feature graph is subjected to deconvolution processing and feature screening through the decoding sub-network, compared with the traditional mode, more abundant features can be generated, and the accuracy of face key point positioning is improved.
In one embodiment, as shown in fig. 8, acquiring the sample to-be-processed face image set includes:
step 802, an original face image set is obtained.
And step 804, acquiring a position label corresponding to each face image in the original face image set.
And 808, performing three-dimensional face reconstruction according to the position label, performing space rendering on the reconstructed face to obtain a sample texture position map, and taking the sample texture position map as a training label.
And 810, carrying out normalization processing on the original face image set and the training labels to obtain a sample face image set.
The original face image set refers to an image obtained by detecting a face and cutting or clipping the image to obtain a face image contained in the original face image set. The original facial image set can be stored locally in advance, or can be obtained from a server. The original face image set comprises a plurality of face images, the server obtains position tags corresponding to the face images, and the position tags are real labeling tags of the face images. Therefore, the server performs three-dimensional Face reconstruction according to the position tag, for example, the three-dimensional Face reconstruction may be performed by a BFM (base Face Model, Basel Face Model). And rendering the reconstructed face in a UV space by the server to obtain a sample texture position map, wherein the sample texture position map is a UV position map expressing face information. And then the sample texture position map is used as a training label. And the server performs normalization processing on the original face image set and the training labels to obtain a sample face image set.
In this embodiment, three-dimensional reconstruction is performed according to a position tag corresponding to each face image in an original face image set, spatial rendering is performed on a reconstructed face to obtain a sample texture position map, the sample texture position map is used as a training tag, and normalization processing is performed on the original face image set and the training tag to obtain a sample face image set. The three-dimensional face key point prediction is carried out by adopting a deep learning method to directly learn regression three-dimensional face information, richer face information can be obtained, and the problem that the error of a two-dimensional key point positioning algorithm is larger when the face is in a large-angle posture is solved.
The following illustrates a network structure of the preset face key point location model, and trains the preset face key point location model. The size of the sample face image set is 128 × 3. The network structure of the preset human face key point positioning model comprises an encoding sub-network and a decoding sub-network. The encoding subnetwork is connected with the decoding subnetwork. The coding sub-network may include 1 convolutional layer and 8 residual blocks, etc., and the decoding sub-network may include 8 efficient deconvolution layers, 1 Add layer, 2 deconvolution layers, 1 convolutional layer, etc.
In the coding sub-network, the first convolutional layer has a plurality of filters. The first convolution layer connects the first BN layer and the first Relu layer, and the output of the first Relu layer is directly input to the first residual block. Specifically, the first convolution layer includes 8 filters, 1 filter has 1 convolution kernel correspondingly, the size of each convolution kernel is 3 × 3, the step size of the convolution operation is 1, the feature size output by the first convolution layer is 128 × 8, and thus the feature map with 8 pixels being 128 × 128 is obtained. The first volume layer is sequentially connected with the first BN layer and the first Relu layer, the output of the first volume layer is subjected to normalization processing through the first BN layer, the output of the first BN layer is subjected to activation processing through the first Relu layer, and the size of a characteristic diagram output by the first volume layer is not changed by the processing of the first BN layer and the first Relu layer.
And grouping the residual blocks according to the number of the filters, wherein each group comprises two residual blocks. The step size of the first residual block convolution operation in each packet is set to 2 to reduce the size of the output signature, and the step size of the second residual block convolution operation is set to 1 to increase the depth of the network. The plurality of grouped residual blocks may gradually increase the number of feature channels in the connection order to gradually increase the number of feature maps. The plurality of residual blocks each include a plurality of convolution layers, an Add layer, a BN layer, and a Relu layer. The convolutional layer may include a plurality of convolution kernels. The output of the Relu layer in the last residual block is directly input to the next residual block.
Specifically, the first residual block includes 4 convolutional layers. The convolution layers are connected in sequence, the last convolution layer is connected with the corresponding Add layer, and the Add layer, the BN layer and the Relu layer are connected in sequence. The 4 convolutional layers may include 16 convolutional kernels, each of which has a size of 3 x 3, and the step size of the convolution operation is 2, for reducing the size of the output feature map. The last convolution layer of the first residual block outputs a feature map with 16 pixels of 64 x 64, the last convolution layer of the first residual block is connected with an Add layer, the Add layer adds the output of the last convolution layer of the residual block and the input of the residual block to realize feature fusion, the Add layer is sequentially connected with a BN layer and a Relu layer, the output of the Add layer is subjected to normalization processing through the BN layer, the output of the BN layer is subjected to activation processing through the Relu layer, and the processing of the Add layer, the BN layer and the Relu layer does not change the size of the feature map output by the last convolution layer. The output of the Relu layer in the first residual block is directly input to the second residual block.
The second residual block comprises 16 convolution kernels, each convolution kernel having a size of 3 x 3, the convolution operation having a step size of 1 for increasing the depth of the network. And the output of the last convolution layer in the second residual block passes through the corresponding Add layer, BN layer and Relu layer, and the output of the Relu layer in the second residual block is directly input into the third residual block.
The third residual block comprises 32 convolution kernels, each convolution kernel having a size of 3 x 3 and the convolution operation having a step size of 2, for reducing the size of the output signature. And after the output of the last convolution layer in the third residual block passes through the corresponding Add layer, the BN layer and the Relu layer, outputting a characteristic diagram with 32 pixels being 32 × 32. The output of the Relu layer in the third residual block is directly input to the fourth residual block.
The fourth residual block comprises 32 convolution kernels, each convolution kernel having a size of 3 x 3, the convolution operation having a step size of 1 for increasing the depth of the network. And after the output of the last convolution layer in the fourth residual block passes through the corresponding Add layer, the BN layer and the Relu layer, outputting a characteristic diagram with 32 pixels being 32 × 32. The output of the Relu layer in the fourth residual block is directly input to the fifth residual block.
The fifth residual block comprises 64 convolution kernels, each convolution kernel having a size of 3 x 3 and the convolution operation having a step size of 2, for reducing the size of the output signature. And outputting a characteristic diagram with 64 pixels being 16 x 16 after the output of the last convolution layer in the fifth residual block passes through the corresponding Add layer, BN layer and Relu layer. The output of the Relu layer in the fifth residual block is directly input to the sixth residual block.
The sixth residual block comprises 64 convolution kernels, each convolution kernel having a size of 3 and the step size of the convolution operation being 1 for increasing the depth of the network. And outputting a characteristic diagram with 64 pixels being 16 x 16 after the output of the last convolution layer in the sixth residual block passes through the corresponding Add layer, BN layer and Relu layer. The output of the Relu layer in the sixth residual block is directly input to the seventh residual block.
The seventh residual block comprises 128 convolution kernels, each convolution kernel having a size of 3 x 3 and the convolution operation having a step size of 2, for reducing the size of the output signature. And outputting a characteristic diagram with 128 pixels being 8 x 8 after the output of the last convolution layer in the seventh residual block passes through the corresponding Add layer, BN layer and Relu layer. The output of the Relu layer in the seventh residual block is directly input to the eighth residual block.
The eighth residual block comprises 128 convolution kernels, each convolution kernel having a size of 3 x 3, the convolution operation having a step size of 1 for increasing the depth of the network. And outputting a characteristic diagram with 128 pixels being 8 x 8 after the output of the last convolution layer in the eighth residual block passes through the corresponding Add layer, BN layer and Relu layer. The output of the Relu layer in the eighth residual block is directly input to the first Add layer of the decoding sub-network.
In the decoding subnetwork, the first efficient deconvolution layer comprises 128 convolution kernels, each convolution kernel having a size of 3 x 3, the step size of the convolution operation being 1. And inputting the sample face image set into the first efficient deconvolution layer, directly inputting the output of the first efficient deconvolution layer into the first Add layer, carrying out fusion processing on the output of the first efficient deconvolution layer and the feature map output by the eighth residual block through the first Add layer, and inputting the fused feature map into the second efficient deconvolution layer.
The second efficient deconvolution layer includes 128 convolution kernels, each convolution kernel having a size of 3 x 3, and the step size of the convolution operation is 2, for enlarging the size of the output signature. The second high efficiency deconvolution layer outputs a characteristic graph of 128 pixels 16 × 16, and the output of the second high efficiency deconvolution layer is input to the third high efficiency deconvolution layer.
The third efficient deconvolution layer includes 64 convolution kernels, each convolution kernel having a size of 3 x 3, and the step size of the convolution operation is 2, for enlarging the size of the output signature. And outputting the output of the third efficient deconvolution layer into a feature map with 16 x 16 pixels after passing through the corresponding BN layer and Relu layer. The output of the Relu layer is directly input to the fourth convolutional layer.
The fourth efficient deconvolution layer includes 64 convolution kernels, each convolution kernel having a size of 3 x 3, and the step size of the convolution operation is 2. And outputting the output of the fourth high-efficiency deconvolution layer into a feature map with 64 pixels of 32 x 32 after passing through the corresponding BN layer and Relu layer. The output of the Relu layer is directly input to the fifth convolutional layer.
The fifth high efficiency deconvolution layer includes 32 convolution kernels, each convolution kernel having a size of 3 x 3, and the step size of the convolution operation is 1, for enlarging the size of the output feature map. And outputting the output of the fifth high-efficiency deconvolution layer into a feature map with 32 pixels by 32 after passing through the corresponding BN layer and Relu layer. The output of the Relu layer is directly input to the sixth convolution layer.
The sixth high efficiency deconvolution layer includes 32 convolution kernels, each convolution kernel having a size of 3 x 3, and the step size of the convolution operation is 2. And outputting the output of the sixth high-efficiency deconvolution layer into a feature map with 32 pixels of 64 x 64 after passing through the corresponding BN layer and Relu layer. The output of the Relu layer is directly input to the seventh convolutional layer.
The seventh efficient deconvolution layer includes 16 convolution kernels, each convolution kernel having a size of 3 x 3, and the step size of the convolution operation is 1. And (3) outputting the output of the seventh high-efficiency deconvolution layer into a feature map with 16 pixels of 64 x 64 after passing through the corresponding BN layer and Relu layer. The output of the Relu layer is directly input to the eighth convolutional layer.
The eighth high efficiency deconvolution layer includes 16 convolution kernels, each convolution kernel having a size of 3 x 3, the step size of the convolution operation being 2. And outputting the output of the seventh efficient deconvolution layer into a feature map with 16 pixels being 128 x 128 after passing through the corresponding BN layer and Relu layer. The output of the Relu layer is directly input to the first normal deconvolution layer.
The first common deconvolution layer comprises 3 convolution kernels, each convolution kernel having a size of 3 x 3, the step size of the convolution operation being 1. And the output of the first common deconvolution layer is input into the second convolution layer, and BN normalization and Relu activation processing are carried out on the output of the second convolution layer. The second convolution layer comprises 3 convolution kernels, each convolution kernel having a size of 3 x 3, the step size of the convolution operation being 1. The output of the second convolutional layer is input to the last ordinary deconvolution layer, i.e. the second ordinary deconvolution layer. The second common deconvolution layer comprises 3 convolution kernels, each convolution kernel having a size of 3 x 3, the step size of the convolution operation being 1. And carrying out BN normalization and Relu activation on the output of the second common deconvolution layer to obtain a texture position graph.
The network structure of the preset human face key point positioning model can further comprise a Euclidean distance layer connected to the second common deconvolution layer, the Euclidean distance layer calculates a texture position image and position labels of a sample human face image set adopted by training to obtain a loss function value of the preset human face key point positioning model, and a gradient descent algorithm is adopted to update network parameters of the model according to the loss function value.
It should be understood that although the steps in the flowcharts of fig. 2, 3, 7 and 8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 3, 7, and 8 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 9, there is provided a face keypoint locating device, including: a communication module 902, a first output module 904, a second output module 906, and a calculation module 910, wherein:
and the communication module 902 is configured to acquire a face image to be processed.
A first output module 904, configured to input a face image to be processed into a trained face key point location model, where the face key point location model includes a coding sub-network and a decoding sub-network; and performing prediction operation on the face image to be processed through the coding sub-network, outputting a first feature map, and inputting the first feature map and the face image to be processed into a decoding processing layer of the decoding sub-network.
A second output module 906, configured to perform deconvolution processing on the face image to be processed through a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, perform feature screening on the second feature map, perform splicing processing on the screened feature map and the second feature map, output a third feature map, and input the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork.
And the calculating module 910 is configured to calculate, according to the texture position map and the preset index, to obtain face key point feature information corresponding to the face image to be processed.
In one embodiment, the first preset deconvolution layer includes an attention layer and a connection layer, and the second output module 906 is further configured to extract a weight parameter corresponding to each channel feature in the second feature map through the attention layer, and generate an extraction result; and inputting the extraction result and the second feature map into the connecting layer, splicing the extraction result and the second feature map through the connecting layer, and outputting a third feature map.
In one embodiment, the first output module 904 is further configured to perform a prediction operation on the image to be processed through a coding processing layer of the coding sub-network, so as to obtain an output feature map; carrying out fusion processing on the output characteristic diagram of the previous coding processing layer in the coding sub-network and the input characteristic diagram of the previous coding processing layer; and inputting the fused feature map into the next coding processing layer until the last coding processing layer of the coding sub-network outputs the first feature map.
In one embodiment, the decoding processing layer includes a fusion layer, and the first output module 904 is further configured to input the third feature map and the first feature map into the fusion layer, and perform fusion processing on the third feature map and the first feature map through the fusion layer to obtain a fusion feature map; and taking the fused feature map as the input of the next decoding processing layer.
In one embodiment, the first output module 904 is further configured to perform feature extraction on the face image to be processed through an input layer in the decoding processing layer; and inputting the extracted features into a first preset deconvolution layer, and performing feature deletion processing on the extracted features through the first preset deconvolution layer.
In one embodiment, the above apparatus further comprises:
the image set acquisition module is used for acquiring a sample face image set;
the first output module 904 is further configured to input the sample face image set to the preset face key point positioning model, and output a first feature map through a coding sub-network of the preset face key point positioning model;
a second output module 906, further configured to input the first feature map and the sample face image set into a decoding subnetwork of the preset face key point location model, and output a texture location map and a loss function value of the preset face key point location model;
and the condition judgment module is used for stopping training the preset face key point positioning model when the loss function value meets the preset condition to obtain the trained face key point positioning model.
In one embodiment, the image set acquisition module is further configured to acquire an original face image set; acquiring position labels corresponding to all face images in an original face image set; performing three-dimensional face reconstruction according to the position label, performing space rendering on the reconstructed face to obtain a sample texture position map, and taking the sample texture position map as a training label; and carrying out normalization processing on the original face image set and the training labels to obtain a sample face image set.
For the specific definition of the face keypoint locating device, reference may be made to the above definition of the face keypoint locating method, which is not described herein again. All or part of the modules in the face key point positioning device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing the trained face key point positioning model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for locating key points of a human face.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the various embodiments described above when the processor executes the computer program.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the respective embodiments described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for locating key points of a human face is characterized by comprising the following steps:
acquiring a face image to be processed;
inputting the face image to be processed into a trained face key point positioning model, wherein the face key point positioning model comprises a coding sub-network and a decoding sub-network;
performing prediction operation on the face image to be processed through the coding sub-network, outputting a first feature map, and inputting the first feature map and the face image to be processed to a decoding processing layer of the decoding sub-network;
carrying out deconvolution processing on the face image to be processed through a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, carrying out feature screening on the second feature map, carrying out splicing processing on the screened feature map and the second feature map, outputting a third feature map, and inputting the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork;
and calculating to obtain the feature information of the key points of the face corresponding to the face image to be processed according to the texture position image and a preset index.
2. The method according to claim 1, wherein the first predetermined deconvolution layer includes an attention layer and a connection layer, and the performing feature filtering on the second feature map, performing stitching processing on the filtered features and the second feature map, and outputting a third feature map includes:
extracting the weight parameters corresponding to the channel features in the second feature map through the attention layer to generate an extraction result;
inputting the extraction result and the second feature map into a connecting layer, splicing the extraction result and the second feature map through the connecting layer, and outputting a third feature map.
3. The method of claim 1, wherein the performing a prediction operation on the image to be processed by the coding sub-network to output a first feature map comprises:
performing prediction operation on the image to be processed through a coding processing layer of the coding subnetwork to obtain an output characteristic diagram;
performing fusion processing on the output characteristic diagram of the last coding processing layer in the coding sub-network and the input characteristic diagram of the last coding processing layer;
and inputting the fused feature map into the next coding processing layer until the last coding processing layer of the coding sub-network outputs the first feature map.
4. The method of any of claims 1 to 3, wherein the decoding process layers comprise a fusion layer, and wherein inputting the third feature map and the first feature map into a next decoding process layer until a last decoding process layer of the decoding sub-network outputs a texture location map comprises:
inputting the third feature map and the first feature map into a fusion layer, and performing fusion processing on the third feature map and the first feature map through the fusion layer to obtain a fusion feature map;
and taking the fused feature map as the input of the next decoding processing layer.
5. The method according to any one of claims 1 to 3, wherein before the deconvolution processing is performed on the face image to be processed by a first preset deconvolution layer in the decoding processing layers, the method further comprises:
performing feature extraction on the face image to be processed through an input layer in the decoding processing layer;
inputting the extracted features into a first preset deconvolution layer, and performing feature deletion processing on the extracted features through the first preset deconvolution layer.
6. The method according to claim 1, wherein before the obtaining of the face image to be processed, the method further comprises:
acquiring a sample face image set;
inputting the sample face image set into a preset face key point positioning model, and outputting a first feature map through a coding sub-network of the preset face key point positioning model;
inputting the first feature map and the sample face image set into a decoding sub-network of a preset face key point positioning model, and outputting a texture position map and a loss function value of the preset face key point positioning model;
and when the loss function value meets a preset condition, stopping training the preset face key point positioning model to obtain a trained face key point positioning model.
7. The method of claim 6, wherein obtaining the sample set of face images comprises:
acquiring an original face image set;
acquiring position labels corresponding to the face images in the original face image set;
performing three-dimensional face reconstruction according to the position label, performing space rendering on the reconstructed face to obtain a sample texture position map, and taking the sample texture position map as a training label;
and carrying out normalization processing on the original face image set and the training labels to obtain a sample face image set.
8. A face keypoint locating apparatus, the apparatus comprising:
the communication module is used for acquiring a face image to be processed;
the first output module is used for inputting the face image to be processed into a trained face key point positioning model, and the face key point positioning model comprises a coding sub-network and a decoding sub-network; performing prediction operation on the face image to be processed through the coding sub-network, outputting a first feature map, and inputting the first feature map and the face image to be processed to a decoding processing layer of the decoding sub-network;
a second output module, configured to perform deconvolution processing on the face image to be processed by using a first preset deconvolution layer in the decoding processing layers to obtain a second feature map, perform feature screening on the second feature map, perform splicing processing on the screened feature map and the second feature map, output a third feature map, and input the third feature map and the first feature map to a next decoding processing layer until a texture position map is output by a last decoding processing layer of the decoding subnetwork;
and the computing module is used for computing the face key point characteristic information corresponding to the face image to be processed according to the texture position image and a preset index.
9. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010985328.9A 2020-09-18 2020-09-18 Face key point positioning method and device, computer equipment and storage medium Pending CN112115860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010985328.9A CN112115860A (en) 2020-09-18 2020-09-18 Face key point positioning method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010985328.9A CN112115860A (en) 2020-09-18 2020-09-18 Face key point positioning method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112115860A true CN112115860A (en) 2020-12-22

Family

ID=73800295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010985328.9A Pending CN112115860A (en) 2020-09-18 2020-09-18 Face key point positioning method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112115860A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112907708A (en) * 2021-02-05 2021-06-04 深圳瀚维智能医疗科技有限公司 Human face cartoon method, equipment and computer storage medium
CN113609900A (en) * 2021-06-25 2021-11-05 南京信息工程大学 Local generation face positioning method and device, computer equipment and storage medium
CN114119923A (en) * 2021-11-29 2022-03-01 浙江大学 Three-dimensional face reconstruction method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096538A (en) * 2016-06-08 2016-11-09 中国科学院自动化研究所 Face identification method based on sequencing neural network model and device
CN108304765A (en) * 2017-12-11 2018-07-20 中国科学院自动化研究所 Multitask detection device for face key point location and semantic segmentation
US20190164341A1 (en) * 2017-11-27 2019-05-30 Fotonation Limited Systems and Methods for 3D Facial Modeling
CN110113616A (en) * 2019-06-05 2019-08-09 杭州电子科技大学 A kind of multi-layer monitor video Efficient Compression coding, decoding apparatus and method
CN110309706A (en) * 2019-05-06 2019-10-08 深圳市华付信息技术有限公司 Face critical point detection method, apparatus, computer equipment and storage medium
CN110334587A (en) * 2019-05-23 2019-10-15 北京市威富安防科技有限公司 Training method, device and the crucial independent positioning method of face key point location model
CN111274977A (en) * 2020-01-22 2020-06-12 中能国际建筑投资集团有限公司 Multitask convolution neural network model, using method, device and storage medium
US20200234034A1 (en) * 2019-01-18 2020-07-23 Snap Inc. Systems and methods for face reenactment
CN111639517A (en) * 2020-02-27 2020-09-08 北京迈格威科技有限公司 Face image screening method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096538A (en) * 2016-06-08 2016-11-09 中国科学院自动化研究所 Face identification method based on sequencing neural network model and device
US20190164341A1 (en) * 2017-11-27 2019-05-30 Fotonation Limited Systems and Methods for 3D Facial Modeling
CN108304765A (en) * 2017-12-11 2018-07-20 中国科学院自动化研究所 Multitask detection device for face key point location and semantic segmentation
US20200234034A1 (en) * 2019-01-18 2020-07-23 Snap Inc. Systems and methods for face reenactment
CN110309706A (en) * 2019-05-06 2019-10-08 深圳市华付信息技术有限公司 Face critical point detection method, apparatus, computer equipment and storage medium
CN110334587A (en) * 2019-05-23 2019-10-15 北京市威富安防科技有限公司 Training method, device and the crucial independent positioning method of face key point location model
CN110113616A (en) * 2019-06-05 2019-08-09 杭州电子科技大学 A kind of multi-layer monitor video Efficient Compression coding, decoding apparatus and method
CN111274977A (en) * 2020-01-22 2020-06-12 中能国际建筑投资集团有限公司 Multitask convolution neural network model, using method, device and storage medium
CN111639517A (en) * 2020-02-27 2020-09-08 北京迈格威科技有限公司 Face image screening method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张伟,钱沄涛等: "引入全局约束的精简人脸关键点检测网络", 《信号处理》, vol. 35, pages 507 - 515 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112907708A (en) * 2021-02-05 2021-06-04 深圳瀚维智能医疗科技有限公司 Human face cartoon method, equipment and computer storage medium
CN112907708B (en) * 2021-02-05 2023-09-19 深圳瀚维智能医疗科技有限公司 Face cartoon method, equipment and computer storage medium
CN113609900A (en) * 2021-06-25 2021-11-05 南京信息工程大学 Local generation face positioning method and device, computer equipment and storage medium
CN113609900B (en) * 2021-06-25 2023-09-12 南京信息工程大学 Face positioning method and device for local generation, computer equipment and storage medium
CN114119923A (en) * 2021-11-29 2022-03-01 浙江大学 Three-dimensional face reconstruction method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN109389030B (en) Face characteristic point detection method and device, computer equipment and storage medium
CN110889325B (en) Multitasking facial motion recognition model training and multitasking facial motion recognition method
CN109543627B (en) Method and device for judging driving behavior category and computer equipment
CN110135406B (en) Image recognition method and device, computer equipment and storage medium
CN109344742B (en) Feature point positioning method and device, storage medium and computer equipment
CN111079632A (en) Training method and device of text detection model, computer equipment and storage medium
CN111950329A (en) Target detection and model training method and device, computer equipment and storage medium
CN111814794B (en) Text detection method and device, electronic equipment and storage medium
CN112115860A (en) Face key point positioning method and device, computer equipment and storage medium
CN110516541B (en) Text positioning method and device, computer readable storage medium and computer equipment
CN111680672B (en) Face living body detection method, system, device, computer equipment and storage medium
CN111192278B (en) Semantic segmentation method, semantic segmentation device, computer equipment and computer readable storage medium
CN111368672A (en) Construction method and device for genetic disease facial recognition model
CN112183295A (en) Pedestrian re-identification method and device, computer equipment and storage medium
CN111062324A (en) Face detection method and device, computer equipment and storage medium
CN111191533A (en) Pedestrian re-identification processing method and device, computer equipment and storage medium
CN113435330A (en) Micro-expression identification method, device, equipment and storage medium based on video
CN113192175A (en) Model training method and device, computer equipment and readable storage medium
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
WO2023279799A1 (en) Object identification method and apparatus, and electronic system
CN110807463B (en) Image segmentation method and device, computer equipment and storage medium
CN111985340A (en) Face recognition method and device based on neural network model and computer equipment
CN114821736A (en) Multi-modal face recognition method, device, equipment and medium based on contrast learning
CN111709415A (en) Target detection method, target detection device, computer equipment and storage medium
CN113449586A (en) Target detection method, target detection device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination