CN112613447A

CN112613447A - Key point detection method and device, electronic equipment and storage medium

Info

Publication number: CN112613447A
Application number: CN202011596254.6A
Authority: CN
Inventors: 陈祖凯; 李思颖; 王权; 钱晨
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-06

Abstract

The disclosure relates to a key point detection method and apparatus, an electronic device, and a storage medium. The method comprises the following steps: acquiring a face image; processing a face and at least one face organ in a face image by using at least two neural network branches included by a target neural network to obtain a face key point information set, wherein the face key point information set comprises key point information of the face and key point information of the face organ; the at least two neural network branches comprise a first network branch for detecting a human face and at least one second network branch for detecting human face organs, a first detection result output by the first network branch comprises global information of the human face, the key point information of the human face and/or the feature information of a human face image included in the global information of the human face are transmitted to the second network branches, and the second network branches are used for determining the corresponding key point information of the human face organs by combining the global information of the human face.

Description

Key point detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method and an apparatus for detecting a keypoint, an electronic device, and a storage medium.

Background

The human face key point detection is the basis of a plurality of human face related applications, can provide position correction for technologies such as human face recognition and the like, and also provides human face semantic information for scenes such as augmented reality, makeup special effects and the like. Therefore, how to detect key points of the face becomes a problem to be solved urgently at present.

In the related method, after the whole face key points of the face are obtained, based on the whole face key points of the face, the face key points are used for respectively intercepting organs of the face from an original image for many times, and then the intercepted organs are respectively input into the corresponding face organ key point detection models.

Disclosure of Invention

The present disclosure presents a key point detection scheme.

According to an aspect of the present disclosure, there is provided a keypoint detection method, including:

acquiring a face image; processing a face and at least one face organ in the face image by using at least two neural network branches included by a target neural network to obtain a face key point information set, wherein the face key point information set comprises key point information of the face and key point information of the face organ; the at least two neural network branches comprise a first network branch for detecting a human face and at least one second network branch for detecting human face organs, a first detection result output by the first network branch comprises global information of the human face, the key point information of the human face and/or the feature information of a human face image included in the global information of the human face are transmitted to each second network branch, and each second network branch is used for determining the corresponding key point information of the human face organs by combining the global information of the human face.

In one possible implementation, the second network branch includes a region of interest calibration layer and a feature extraction layer; the detecting the face and at least one face organ in the face image by using at least two neural network branches included by the target neural network to obtain a face key point information set comprises the following steps: detecting the face through the first network branch to obtain a first detection result; extracting a detection area of a human face organ through the interested area calibration layer of each second network branch; extracting image features of a detection area corresponding to the at least one face organ through a feature extraction layer of each second network branch; and obtaining the key point information of the at least one face organ by using the image characteristics of the detection area and the global information of the face.

In a possible implementation manner, the obtaining, by using the image features of the detection region and the key point information of the face, the key point information of the at least one face organ includes: performing at least one fusion processing on the image characteristics of the detection area and the corresponding key point information of the face and/or the characteristic information of the face image to obtain fusion characteristic information; and obtaining key point information of the face organ according to the fusion characteristic information.

In a possible implementation manner, the first detection result further includes detection frame information of at least one human face organ; the extracting of the detection area of the human face organ through the interested area calibration layer of each second network branch comprises: determining the position coordinates of the at least one human face organ in the human face image according to the detection frame information of the at least one human face organ; and extracting the detection areas of the human face organs respectively matched with the detection frame information of the at least one human face organ through the interested area calibration layer of each second network branch under the precision of the position coordinates.

In one possible implementation manner, the detection region of the human face organ includes: and extracting the characteristic information of the face organ region from the face image and/or extracting the characteristic information of the face organ region from the characteristic information of the face image output by the first network branch.

In a possible implementation manner, the first detection result further includes detection frame information of at least one human face organ; before the extracting, by the region of interest calibration layer of each second network branch, a detection region of a human face organ, the method further includes: performing enhancement processing on the detection frame information of the at least one human face organ, wherein the enhancement processing comprises the following steps: a scaling transform process and/or a translation transform process.

In one possible implementation manner, the number of the key points of the face in the key point information of the face includes 68 to 128; and/or the number of the key points of the mouth in the key point information of the face organ comprises 40-80; and/or the number of the key points of the left eye in the key point information of the face organ comprises 16 to 32; and/or the number of key points of the right eye in the key point information of the face organ comprises 16 to 32; and/or the number of key points of the left eyebrow in the key point information of the face organ comprises 10-20; and/or the number of the key points of the right eyebrow in the key point information of the face organ comprises 10-20.

In one possible implementation, the face image includes a key point annotation, and the method further includes: determining the error loss of the target neural network according to the key point labels and the face key point information set; and jointly updating the parameters of at least two neural network branches in the target neural network according to the error loss.

According to an aspect of the present disclosure, there is provided a keypoint detection apparatus, comprising:

the image acquisition module is used for acquiring a face image; the key point detection module is used for processing a face and at least one face organ in the face image by using at least two neural network branches included by a target neural network to obtain a face key point information set, wherein the face key point information set comprises key point information of the face and key point information of the face organ; the at least two neural network branches comprise a first network branch for detecting a human face and at least one second network branch for detecting human face organs, a first detection result output by the first network branch comprises global information of the human face, the key point information of the human face and/or the feature information of a human face image included in the global information of the human face are transmitted to each second network branch, and each second network branch is used for determining the corresponding key point information of the human face organs by combining the global information of the human face.

In one possible implementation, the second network branch includes a region of interest calibration layer and a feature extraction layer; the key point detection module is used for: detecting the face through the first network branch to obtain a first detection result; extracting a detection area of a human face organ through the interested area calibration layer of each second network branch; extracting image features of a detection area corresponding to the at least one face organ through a feature extraction layer of each second network branch; and obtaining the key point information of the at least one face organ by using the image characteristics of the detection area and the global information of the face.

In one possible implementation, the key point detecting module is further configured to: performing at least one fusion processing on the image characteristics of the detection area and the corresponding key point information of the face and/or the characteristic information of the face image to obtain fusion characteristic information; and obtaining key point information of the face organ according to the fusion characteristic information.

In a possible implementation manner, the first detection result further includes detection frame information of at least one human face organ; the keypoint detection module is further to: determining the position coordinates of the at least one human face organ in the human face image according to the detection frame information of the at least one human face organ; and extracting the detection areas of the human face organs respectively matched with the detection frame information of the at least one human face organ through the interested area calibration layer of each second network branch under the precision of the position coordinates.

In a possible implementation manner, the first detection result further includes detection frame information of at least one human face organ; the key point detection module is further configured to: performing enhancement processing on the detection frame information of the at least one human face organ, wherein the enhancement processing comprises the following steps: a scaling transform process and/or a translation transform process.

In one possible implementation, the face image includes a key point annotation, and the apparatus is further configured to: determining the error loss of the target neural network according to the key point labels and the face key point information set; and jointly updating the parameters of at least two neural network branches in the target neural network according to the error loss.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: the above-described keypoint detection method is performed.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described keypoint detection method.

In the embodiment of the disclosure, a face image is obtained, a target neural network comprising a first network branch for detecting a face and at least one second network branch for detecting a face organ is utilized to detect the face in the face image to obtain a first detection result comprising global information of the face, and key point information of the face in the global information and/or feature information of the face image are transmitted to each second network branch to detect the at least one face organ in combination with the global information, so as to obtain a face key point information set comprising the key point information of the face and the key point information of the face organ. Through the process, the key points of the face and the key points of the face organs can be obtained simultaneously by using the target neural network, on one hand, the end-to-end face key point recognition can be realized, and the speed and the efficiency of the key point recognition are improved through the target neural network; on the other hand, under the condition that the face image has occlusion, each second network branch can determine the key point information of the face organ by using the key point information of the face and/or the feature information of the face image output by the first network branch, so that the influence of the occlusion on the detection of the key points of the face organ can be reduced, and the detection precision of the obtained key points of the face organ is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure.

Fig. 2 shows a flow chart of a keypoint detection method according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a keypoint detection method according to an application example of the present disclosure.

Fig. 4 shows a schematic diagram of a keypoint detection method according to an application example of the present disclosure.

Fig. 5 shows a flow chart of a keypoint detection method according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of a keypoint detection apparatus according to an embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of a keypoint detection method according to an application example of the present disclosure.

FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Fig. 9 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure, which may be applied to a keypoint detection apparatus, where the keypoint detection apparatus may be a terminal device, a server, or other processing device. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In one example, the key point detection method can be applied to a cloud server or a local server, the cloud server can be a public cloud server or a private cloud server, and the method can be flexibly selected according to actual conditions.

In some possible implementations, the keypoint detection method may also be implemented by a processor calling computer-readable instructions stored in a memory.

As shown in fig. 1, in one possible implementation manner, the method for detecting a key point may include:

in step S11, a face image is acquired.

Step S12, processing the face and at least one face organ in the face image by using at least two neural network branches included by the target neural network to obtain a face key point information set, wherein the face key point information set includes key point information of the face and key point information of the face organ.

The at least two neural network branches comprise a first network branch for detecting a human face and at least one second network branch for detecting human face organs, a first detection result output by the first network branch comprises global information of the human face, the key point information of the human face and/or the feature information of a human face image included in the global information of the human face are transmitted to the second network branches, and the second network branches are used for determining the corresponding key point information of the human face organs by combining the global information of the human face.

The face image may be an image frame including a face, and the implementation form may be flexibly determined according to an actual situation, which is not limited in the embodiment of the present disclosure. The number of faces included in the face image is not limited in the embodiment of the present disclosure, and may include a face of a certain object, and may also include faces of multiple objects at the same time, and under the condition that the face image includes multiple faces, the keypoint information of the faces and/or the keypoint information of the face organs, which correspond to the faces of the multiple objects, may be obtained by the keypoint detection method provided in the embodiment of the present disclosure, respectively. The following disclosure embodiments are described by taking an example that the face image includes a face of a single object, and the face image includes face conditions of a plurality of objects, and flexible expansion can be performed by referring to the following disclosure embodiments, which are not repeated one by one.

The method for acquiring the face image is not limited in the embodiment of the present disclosure, for example, the database storing the face image may be read to obtain the face image, or the face image may be acquired in some scenes, or the face image may be obtained by performing segmentation or sampling on a video including the face, and how to acquire the face image may be flexibly determined according to the actual situation.

In one possible implementation, the face image may be an image obtained from a sequence of face image frames, and the sequence of face image frames may be a sequence of frames including a plurality of face images, and the implementation is not limited in the embodiment of the present disclosure. The method for obtaining the face image from the face image frame sequence is not limited in the embodiment of the present disclosure, and the face image may be obtained by randomly sampling from the face image frame sequence, or by selecting the face image from the face image frame sequence according to a preset requirement.

The number of the face images acquired in step S11 is not limited in the embodiment of the present disclosure, and may be flexibly selected according to actual situations. The following disclosure embodiments are described by taking the example of obtaining a single face image, and the manner after obtaining a plurality of face images can be flexibly extended by referring to the following disclosure embodiments, which are not described in detail in the disclosure embodiments.

The target neural network can be a neural network for processing the face image, and the implementation form of the target neural network can be flexibly determined according to the actual situation. As described in the above disclosed embodiments, in one possible implementation, the target neural network may include at least two neural network branches. The number of the neural network branches included in the target neural network, the implementation form and the connection relation of each neural network branch and the like can be flexibly determined according to actual conditions.

In one possible implementation, the at least two neural network branches may include a first network branch for detecting a human face, and at least one second network branch for detecting a human face organ.

The first network branch may be a network structure for detecting a face in a face image to obtain key point information of the face, and an implementation form of the network branch may be flexibly determined according to an actual situation, which is described in detail in the following disclosure embodiments and is not expanded here.

The first detection result may be a detection result obtained by detecting a human face in the human face image by the first network branch. In a possible implementation manner, the first detection result may include global information of the face, the global information of the face may reflect an overall situation of the face, and information content included in the global information may be flexibly determined according to an actual situation. In a possible implementation manner, the global information of the face may include key point information of the face and/or feature information of the face image, and the like, the feature information of the face image may reflect the overall features of the face in the face image, and a specific implementation form of the features of the face image may be described in detail in the following disclosure embodiments, which is not expanded herein.

The second network branch may be a network structure for detecting each organ of the face in the face image, so as to obtain the key point information of the face organ, and the implementation form thereof may also be flexibly determined according to the actual situation, as detailed in the following disclosure embodiments, which is not expanded herein.

The second detection result may be a detection result obtained by the second network branch detecting at least one face organ of the face in the face image based on the first detection result. In one possible implementation, the second detection result may include key point information of the face organ. The first detection result is a detection result which reflects the whole condition of the face and is obtained by detecting the face in the face image, and the key point information of the face organ in the second detection result is determined based on the first detection result, so that the key point information of the face organ and the key point information of the face in the first detection result can be more uniform, and the face key point information set is more accurate. How the second network branch determines the key point information of the face organ, and the detection process can be flexibly determined according to the actual situation.

In a possible implementation manner, the key point information of the face organs contained in different organs in the face may be detected through a plurality of second network branches included in the target neural network, for example, the key point information of the face organs contained in the mouth may be detected through the second network branches of the mouth, the key point information of the face organs contained in the eyes may be detected through the second network branches of the eyes, and the like. Therefore, in a possible implementation manner, the number of the second network branches included in the target neural network and the organ used for detecting the second network branches can be flexibly determined according to the actual situation of the key point information of the face organ, and when the target neural network includes a plurality of second network branches, the second network branches are independent of each other, that is, the sequence and process of executing detection by each second network branch are respectively performed without being interfered by the detection process or data of other second network branches. In some possible implementations, the structures of the second network branches may be the same or different, and are not limited in the embodiments of the present disclosure. In some possible implementations, the detection process in each second network branch may be the same or different, and is not limited in this disclosure.

It can be seen from the foregoing disclosure that, in a possible implementation manner, global information included in a first detection result output by a first network branch may be transmitted to each second network branch, so that each second network branch may determine, in combination with global information of a human face, key point information of a corresponding human face organ, where what position the global information is transmitted to each second network branch is not limited in the disclosure embodiment, and it may be flexibly determined according to an actual structure of each second network branch, how the second network branch determines the key point information of the human face organ based on the global information may be detailed in each of the following disclosure embodiments, and no expansion is made herein first.

It can be seen from the above disclosure that, by using at least two neural network branches included in the target neural network, a face and at least one face organ in a face image can be processed to obtain a face key point information set. The face key point information set can be related information of face key points contained in the face image, and the contained information content can be flexibly determined according to actual conditions. In one possible implementation, the face key point information set may include key point information of a face and key point information of a face organ.

The key point information of the face may include: and positioning each part or organ in the face so as to determine key points of the whole condition of the face, such as key points on organs such as eyes, mouth, nose or eyebrows and the like in the face, key points on parts such as cheeks, forehead or chin and the like of the face, and the like.

The number of the key point information of the face and which key points in the face are specifically included may be flexibly set according to actual conditions, and the embodiment of the present disclosure is not limited. In a possible implementation manner, the number of the key points of the face may be in the interval of 68 to 128, and the specific number set may be flexibly selected according to the actual situation. In an example, the number of the key points of the Face may be set to be 106, which is denoted as Face 106, in which case, the neural network branch in the target neural network may perform Face key point information detection on the input Face image, so as to output 106 pieces of Face key point information.

The key point information of the face organ may include: the key points included in each part or organ of the face can be used for determining the key points of the condition of the local part or organ of the face, such as the key points included in the organs of the face, such as eyes, mouth, nose, eyebrows and the like. Compared with the key point information of the human face mentioned in the above disclosed embodiment, the key point information of the human face organ may describe a part or an organ on the same human face as the key point information of the human face, but since the key point information of the human face is used for determining the overall situation of the human face and the key point information of the human face organ is used for determining the local situation of the part or the organ, the number of key points of the key point information of the human face organ may be larger and the distribution on the corresponding part or organ may be more dense than the key point information of the human face describing the same part or organ.

In a possible implementation manner, the number of the key points of the face organ and the key points specifically including which organs or parts in the face may be flexibly set according to the actual situation, which is not limited in the embodiment of the present disclosure. In a possible implementation manner, the number of the key points of the mouth may be in an interval of 40 to 80, the number of the key points of the left eye may be in an interval of 16 to 32, the number of the key points of the right eye may be in an interval of 16 to 32, the number of the key points of the left eyebrow may be in an interval of 10 to 20, the number of the key points of the right eyebrow may be in an interval of 10 to 20, and the specific number of the key points of each organ may be flexibly selected according to actual conditions. In one example, the number of the key point information of the face organ of the mouth in the face may be 64, which is recorded as mouth 64, the number of the key point information of the face organ of the left eye and the right eye is 24 respectively, which is recorded as eye 24, and the number of the key point information of the face organ of the left eyebrow and the right eyebrow is 13 respectively, which is recorded as eye brow 13.

In one possible implementation, the second network branch may include a region of interest calibration layer and a feature extraction layer.

Fig. 2 shows a flowchart of a key point detection method according to an embodiment of the present disclosure, and as shown in the figure, in one possible implementation manner, the step S12 may include:

step S121, detecting the human face through a first network branch to obtain a first detection result;

step S122, extracting the detection area of the face organ through the interested area calibration layer of each second network branch;

step S123, extracting image characteristics of a detection area corresponding to at least one face organ through the characteristic extraction layer of each second network branch;

and step S124, obtaining key point information of at least one face organ by using the image characteristics of the detection area and the global information of the face.

The region-of-interest calibration layer may be a network layer having a region extraction function, and the implementation form of the region-of-interest calibration layer is not limited in the embodiments of the present disclosure, and is described in detail in the following embodiments of the present disclosure, which are not first expanded herein.

The detection area of the human face organ can be the area where the human face organ is located, and the implementation mode can be flexibly determined according to the actual situation. In some possible implementations, the detection region of the face organ may include: and extracting the characteristic information of the face organ region from the face image and/or extracting the characteristic information of the face organ region from the characteristic information of the face image output by the first network branch.

The human face organ region can be a region including a human face organ in the human face image, and the human face organ region can be extracted from the human face image; the feature information of the face organ region may be feature information related to the region of the face organ from feature information of the face image, and the feature information of the face organ region may be extracted from the feature information of the face image. How to extract the detection region of the face organ through the region feature extraction layer can be realized in the following disclosure embodiments, which are not first developed.

The feature extraction layer may be a network layer having a feature extraction function, and the implementation form thereof is also not limited in the embodiment of the present disclosure. In a possible implementation manner, the feature extraction layer may include a single network layer implementation having a feature extraction function, and in a possible implementation manner, the feature extraction layer may also be implemented jointly by a plurality of network layers, and the plurality of network layers may each have a feature extraction function, may partially have a feature extraction function, may not have a feature extraction function, but may implement feature extraction after being combined, and the like.

The feature extraction layer can extract image features of the detection area of the human face organ extracted by the region-of-interest calibration layer to obtain the image features of the detection area. The image features of the detection region may include features obtained by performing single feature extraction on the detection region of the human face organ, or features obtained by performing multiple feature extractions on the detection region of the human face organ at different depths, and how to realize flexible selection may be specifically performed according to actual conditions, which is not limited in the embodiment of the present disclosure.

As can be seen from step S124, in one possible implementation, the key point information of at least one face organ may be obtained based on the extracted image features of the detection region and the global information of the face. The processing method of how to process the image features of the detection area and the global information of the face to obtain the key point information of the face organ can be described in detail in the following disclosure embodiments, which is not expanded herein.

Through the processes of the steps S121 to S124, the key point information of the face and the key point information of the face organ can be obtained simultaneously by using the target neural network, on one hand, the end-to-end face key point identification can be realized, and the speed and the efficiency of the key point identification are improved through the target neural network; on the other hand, the key point information of the face organ can be determined based on the global information in the first detection result, so that the key point information of the face organ can be positioned by utilizing the global information reflecting the whole condition of the face, the positions of the obtained key point information sets of the face in the whole face organ and the local face organ are uniform, the key point information sets of the face are high in precision, and the accuracy of the identified key points is effectively improved.

In one possible implementation, step S124 may include:

performing at least one fusion processing on the image characteristics of the detection area and corresponding key point information of the human face and/or characteristic information of the human face image to obtain fusion characteristic information;

and obtaining key point information of the face organ according to the fusion characteristic information.

As described in the foregoing disclosure embodiments, the image feature of the detection area may be an image feature obtained by the feature extraction layer in the second network branch performing image feature extraction on the detection area, and the implementation form of the image feature may refer to the foregoing disclosure embodiments, and is not described herein again.

Since the detection region is a region obtained by extracting through the region of interest calibration layer, the accuracy of the extraction may affect the accuracy of the image features of the detection region, thereby further affecting the accuracy of the determined key point information of the face organ. Therefore, in a possible implementation manner, in order to improve the accuracy of the determined key point information of the face organ, the feature information of the face image and/or the key point information of the face in the global information obtained in the first network branch may be fused with the feature information of the face organ region to obtain fused feature information, and since the feature information of the face image and/or the key point information of the face are obtained based on the complete face image, the whole face may be embodied, so that the key point information of the face organ obtained by further regression according to the fused feature information may have more uniform position information with the obtained key point information of the face, and thus has higher accuracy.

In the process of feature fusion, which objects are fused specifically is not limited in the embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating an application example of the present disclosure, and as shown in the drawing, in an example, after feature extraction is performed by feature extraction modules (such as a mouth feature extraction module, a left-eye feature extraction module, a right-eyebrow feature extraction module, and the like) of respective second branches, image features of detection regions of respective human face organs can be obtained, the image features of these detection regions may be fused with 106 whole face key points (i.e., key point information of a face) output by the first branch in the image, the fusion form may be a connection form or other forms, etc., to obtain fusion feature information (not shown in the image), and the fusion feature information may further obtain key points of face organs output by each second branch through calculation (e.g., 64 key points of face organs of mouth, 24 key points of face organs of left eye, 13 key points of face organs of right eyebrow, etc.).

Fig. 4 is a schematic diagram illustrating an application example of the present disclosure, and as shown in the drawing, in an example, after feature extraction is performed by feature extraction modules (such as a mouth feature extraction module, a left-eye feature extraction module, a right-eyebrow feature extraction module, and the like) of respective second branches, image features of detection regions of respective human face organs can be obtained, the image features of these detection regions may be fused with feature information of the face image extracted by the shallow feature extraction module of the first branch in the image, the fusion form may be addition or other forms, etc., to obtain fusion feature information (not shown in the image), and the fusion feature information may further obtain key points of the face organs output by each second branch through calculation (e.g., 64 key points of the mouth face organs, 24 key points of the left-eye face organ, 13 key points of the right-eyebrow face organ, etc.).

In some possible implementations, 106 whole key points of the face in fig. 3 (i.e., key point information of the face) and the feature information of the face image in fig. 4 may be fused with image features of the detection regions of the respective face organs.

In a possible implementation manner, the feature information of the face image may include features extracted by a certain network layer or some network layers in the first network branch, and in this case, in the process of fusing the image features of the detection region and the feature information of the face image, the feature information may be fused with features extracted by any network layer in the first network branch, and specifically, which network layer extracted features are selected, which deep layer features or shallow layer features, and the like, may be flexibly determined according to an actual situation, and the embodiment of the present disclosure is not limited. In the process of acquiring the fusion feature information, the fusion is specifically performed for several times, and the fusion can be flexibly determined according to the fused object, so that the embodiment of the present disclosure is not limited in the same way.

The fusion mode can also be flexibly changed according to different fusion objects, and the detailed description is given in the following disclosure embodiments, which are not expanded at first.

After the fusion feature information is obtained, the corresponding fusion feature information may be processed in the second network branch, so as to obtain the output key point information of the face organ, where the processing mode of the second network branch on the fusion feature information is not limited in the embodiment of the present disclosure, and may be flexibly selected according to the actual situation. In a possible implementation manner, the fusion feature information can be processed through a regression layer or a classification layer and other network layers to obtain key point information of the output human face organ; in a possible implementation manner, the fusion feature information may also be processed through a network structure composed of a plurality of network layers to obtain key point information of the output face organ and the like.

Performing at least one fusion processing on the image characteristics of the detection area and the characteristic information of the face image and/or the key point information of the face to obtain fusion characteristic information; and obtaining key point information of the face organ according to the fusion characteristic information. Through the process, the feature information of the face image reflecting the whole face condition and/or the key point information of the face, which is obtained by the first network branch, can be applied to the detection process of the key points of the face organ, so that the obtained key point information of the face organ and the key point information result of the face are unified, and higher precision is achieved.

In some possible implementation manners, the first detection result may further include other information, for example, may further include detection box information of at least one face organ, where the detection box information may be a detection box representing positions of each organ in the face image, and specifically expresses which organ positions in the face image, and may be flexibly set according to an actual situation, and the embodiment of the present disclosure is not limited in this embodiment. The shape and the realization form of the detection frame can be flexibly determined according to the actual situation, and the shape and the realization form of the detection frame of different organs can be the same or different. In a possible implementation manner, the detection frames of each organ may be all rectangular frames, and the position of each detection frame may be represented by coordinates of vertices in the rectangular frames. In a possible implementation manner, the first network branch of the target neural network may output the detection box information of at least one face organ simultaneously when outputting the key points of the face.

Thus, in one possible implementation, step S121 may include:

and obtaining key point information of the face and detection frame information of at least one face organ according to the output of the first network branch.

In some embodiments, step S121 may further include: and obtaining the feature information of the face image according to the intermediate features output by at least one network layer in the first network branch.

As described in the foregoing disclosure embodiments, the implementation form of the first network branch can be flexibly determined according to actual situations. In a possible implementation manner, the first network branch may be formed by a plurality of connected network layers, and the specific implementation form and combination form of these network layers may be flexibly determined according to the actual situation, and are not limited to the following disclosed embodiments.

In a possible implementation manner, the first network branch may include a shallow feature extraction network structure block 0 and a deep feature extraction network structure block main which are connected in sequence, where the block 0 and the block main may include a plurality of network layers in the form of, for example, convolutional layers, pooling layers, or classification layers, respectively, and through the block 0, the first network branch may perform preliminary extraction on feature information of a face image, the obtained preliminary extracted features may be input into the block main to perform further extraction on the feature information of the face image, and regression is performed based on the further extracted features, so as to obtain an output of the first network branch, thereby implementing detection of key point information of the face, and therefore, in one example, the preliminary extracted features extracted by the block 0 may be used as feature information of the face image; in one example, the deep features extracted by the block main can be used as feature information of the face image; in an example, both the preliminary extracted features extracted by block 0 and the deep features further extracted by block main may be used as feature information of a face image, which features are specifically selected as the feature information of the face image may be flexibly determined according to actual situations, and the embodiments of the present disclosure are not limited.

It should be noted that, in the embodiment of the present disclosure, the key point information of the face, the detection frame of the at least one organ, and the feature information of the face image may be obtained simultaneously, or the key point information of the face, the detection frame of the at least one organ, the feature information of the face image, and the like may be obtained respectively according to a certain order.

In one possible implementation, step S122 may include:

determining the position coordinates of at least one face organ in the face image according to the detection frame information of at least one face organ;

and extracting the detection areas of the human face organs respectively matched with the detection frame information of at least one human face organ under the precision of the position coordinates through the interested area calibration layer of each second network branch.

As described in the foregoing embodiments, the detection frame information of the face part may be a detection frame indicating the position of each part in the face image, and therefore, based on the detection frame information, the position coordinates of the face part in the face image may be determined.

After the position coordinates of the face organ in the face image are determined, the face organ region can be extracted through the region-of-interest calibration layer of the second network branch under the precision of the position coordinates.

In an example, the vertex coordinates of the detection frame in the detection frame information may be floating point numbers, and the precision of the position coordinates determined based on the detection frame information may be consistent with the floating point number of the vertex coordinates of the detection frame; in one example, the vertex coordinates of the detection box may be integers, and the precision of the position coordinates may be integers.

As described in the foregoing disclosure, the Region Of Interest calibration layer ROI Align (Region Of Interest Align) may be a network layer having a Region extraction function, and in a possible implementation manner, the Region Of Interest calibration layer may be a network layer that implements the Region extraction function by clipping, and an implementation form Of the Region Of Interest calibration layer is not limited in the disclosure. In a possible implementation manner, in the process of extracting the face image by clipping, the clipping precision of the region-of-interest calibration layer may be consistent with the precision of the position coordinates of the face organ in the face image, and therefore, in a possible implementation manner, in the case that the precision of the position coordinates is a floating point number, the region-of-interest calibration layer may be a network layer capable of performing image clipping with the floating point precision, and therefore, any network layer having the function may be used as an implementation form of the region-of-interest calibration layer.

Determining the position coordinates of the human face organ in the human face image according to the detection frame information; and extracting the face organ region matched with the detection frame information of the face organ under the precision of the position coordinate through the region-of-interest calibration layer of the second network branch.

In some possible implementation manners, in a case that the detection region of the face organ includes feature information of the face organ region, the manner of extracting the feature information of the face organ region from the feature information of the face image through the region-of-interest calibration layer may be implemented by referring to the manner of extracting the face organ region from the face image, and details are not repeated here.

In a possible implementation manner, the method for detecting a key point provided in the embodiment of the present disclosure may further include:

carrying out post-processing on the key point information of the human face and/or the key point information of the human face organ, wherein the post-processing comprises at least one of the following processing operations: key point number adjustment, key point number adjustment and key point sequence adjustment.

The post-processing object can be flexibly determined according to actual conditions, and can only perform post-processing on key point information of the face, or can only perform post-processing on key point information of face organs, or perform post-processing on the key point information of the face and the key point information of the face organs together.

The specific operation content of the post-processing may be some processing operations implemented according to the key point detection requirement, and in a possible implementation manner, the post-processing may include some additional operations that do not change the key point position, such as key point number adjustment, key point numbering, key point sequence adjustment, and the like. The number of the key points may be adjusted by performing corresponding addition or deletion operations on the obtained key points, for example, the obtained key points of some faces or key points of face organs may be deleted, or the positions of other key points in the face image may be obtained by performing calculation based on the positions of the obtained key points of the faces or key points of the face organs; the key point numbering can be carried out according to a certain preset sequence on the obtained key points, the numbering can be continuous or discontinuous, and the number of the numbering can be flexibly set according to the actual situation; the key point sequence adjustment may be to sequentially adjust the numbered key points, and how to adjust may be flexibly determined according to an actual situation, which is not limited in the embodiment of the present disclosure.

By carrying out post-processing on the key point information of the face and/or the key point information of the face organ, the obtained key points can be correspondingly processed according to the actual requirement of key point detection, so that the application range of the identified key points is widened.

and converting the positions of the key point information of the human face organ in the detection area of the human face organ to obtain the positions of the key point information of the human face organ in the human face image.

In a possible implementation manner, since the key point information of the face organ may be obtained by extracting based on the detection area of the face organ, the obtained location of the key point of the face organ may be a location in the detection area of the face organ, and since the key point of the face is obtained by processing based on the face image, the location of the key point of the face may be a location in the face image.

Therefore, in a possible implementation manner, the positions of the key point information of the human face organ in the detection area of the human face organ may be converted to obtain the positions of the key point information of the human face organ in the human face image. The conversion manner is not limited in the embodiment of the present disclosure, and in a possible implementation manner, a position conversion relationship between the face image and the detection region of the face organ may be determined according to the vertex or the central point coordinates of the detection region of the face image and the face organ, and based on the position conversion relationship, the position of the key point of the face organ in the detection region of the face organ is subjected to position conversion, so as to obtain the position of the key point of the face organ in the face image.

The positions of the key point information of the face organ in the detection area of the face organ are converted to obtain the positions of the key point information of the face organ in the face image, so that the positions of the key point information of the face and the positions of the key point information of the face organ can be unified, and the subsequent analysis and operation processing of each key point in the face are facilitated.

In some possible implementation manners, the position of the key point information of the face in the face image may also be converted into the detection region of the face organ, or the position of the key point information of a certain face organ in the detection region of the corresponding face organ is converted into the detection regions of other face organs, and so on, and the key point information of the face and the key point information of each face organ may also be converted into a certain preset image coordinate system, and so on, and how to convert may be flexibly selected according to the actual situation, and is not limited to the above disclosed embodiments.

In a possible implementation manner, the face image in the embodiment of the present disclosure may also be a face image including a key point annotation. As described in the foregoing disclosure embodiments, the keypoint detection method provided in the disclosure embodiments may be implemented based on a target neural network, and therefore, in a possible implementation manner, the method provided in the disclosure embodiments may also be used to train the target neural network based on a face image including keypoint labels.

Under the condition that the face image comprises key point labels, in order to achieve training, the key point labels can comprise key point information labels of the face and/or key point information labels of face organs, wherein the key point information labels of the face can be labels carried out on actual positions of the key point information of the face in the face image, the key point information labels of the face organs can be labels carried out on actual positions of the key point information of the face organs in the face image, and the labeling mode is not limited in the embodiment of the disclosure; in one example, the key point information of the human face and the key point information of the human face organ in the human face image may be automatically labeled by a machine.

In some possible implementations, in a case where the target neural network further includes at least one third network branch, the face image may further include an annotation of a face state corresponding to the third network branch, for example, in a case where the target neural network further includes a third network branch for detecting an eye opening and closing state in the face, the annotation of the eye opening and closing state may be performed on the face image according to a real opening and closing state of eyes in the face image.

Fig. 5 is a flowchart illustrating a method for detecting keypoints according to an embodiment of the present disclosure, and as shown in the drawing, in a possible implementation manner, in a case that a face image includes a keypoint label, the method for detecting keypoints proposed in the embodiment of the present disclosure may include:

in step S11, a face image is acquired.

And step S13, determining the error loss of the target neural network according to the key point labels and the face key point information set.

And step S14, jointly updating the parameters of at least two neural network branches in the target neural network according to the error loss.

The implementation forms of steps S11 to S12 may refer to the above disclosed embodiments, and are not described herein again.

In a possible implementation manner, in a case that the face image may include the keypoint annotation, the implementation form of step S12 may refer to the above disclosed embodiments, where, in a case that step S12 is implemented by step S121 to step S124, in a possible implementation manner, step S12 may further include, before step S122:

and performing enhancement processing on the detection frame information of at least one human face organ, wherein the enhancement processing comprises the following steps: a scaling transform process and/or a translation transform process.

The detection frame information may be detection frame information of the face organ output by the first network branch, and is not described herein again. In a possible implementation manner, in order to enhance the richness of data in the training process, enhancement processing may be performed on the detection frame information, such as the scaling transformation processing and/or the translation transformation processing mentioned in the above disclosed embodiments.

The scaling transformation process may be to expand or compress the detection frame in the obtained detection frame information, and in a possible implementation manner, the detection frame may be randomly scaled in a preset scaling range, and a value of the preset scaling range may be flexibly set according to an actual situation, and is not limited to the following disclosed embodiments. In one example, the preset expansion range may be between 0.9 and 1.1 times the size of the detection frame.

The translation transformation processing may be to move the entire position of the detection frame in the obtained detection frame information, and in a possible implementation manner, the detection frame may be randomly translated within a preset translation range, and the preset translation range may also be flexibly set according to an actual situation, in an example, the preset translation range may be ± 0.05 times of the length of the detection frame in the translation direction, where "+" and "-" in ± represent the translation direction and the opposite direction of the translation direction, respectively.

After the enhanced detection frame information is obtained, in step S122, the first detection result including the enhanced detection frame information and the at least one second network branch are used to detect the at least one face organ to obtain a second detection result, and the specific detection manner may refer to the above-mentioned embodiments, which is not described herein again.

By enhancing the information of the detection frame of at least one face organ, the richness of training data for training the target neural network can be increased, so that the trained target neural network can obtain a better key point detection effect under different input data, the processing precision and robustness of the target neural network are improved, and the accuracy of key point detection is improved.

After the face key point information set is obtained, in step S13, an error between the predicted key point and the labeled key point is determined according to the actual positions of the key point information of the face and the key point information of the face organ labeled in the face image, so as to determine an error loss of the target neural network. And the parameters in the first network branch and the second network branch are jointly updated with the error loss, via step S14.

In step S13, the specific process of determining the error loss can be flexibly determined according to the actual situation, and will be described in detail in the following disclosure embodiments, which are not first expanded. After the error loss is determined, each parameter in the target neural network can be updated reversely according to the error loss, and it can be seen from the above-mentioned disclosure embodiments that the target neural network in the embodiments of the present disclosure can include the first network branch and at least one second network branch, therefore, in a possible implementation manner, in the process of updating the parameter of the target neural network, the parameter updating of the first network branch and the parameter updating of the second network branch can be performed simultaneously, that is, the parameters in the first network branch and the second network branch can be jointly optimized according to the outputs of the two networks, so that the target neural network obtained after training can achieve the global optimal effect in the processes of detecting the key point information of the face and detecting the key point information of the face organs.

In some possible implementations, in a case where the target neural network further includes at least one third network branch, the at least one third network branch may be trained together with the first network branch and the second network branch, i.e., the parameters of the third network branch may be updated together with the first network branch and the second network branch; in some possible implementations, the third network branch may also be trained separately, that is, the parameters of the first network branch and the second network branch may be fixed in the process of updating the parameters of the third network branch.

By determining the error loss of the target neural network according to the face image and jointly updating the parameters in the first network branch and the at least one second network branch according to the error loss, the first network branch and the at least one second network branch can be jointly trained through the process, so that the key point information detection result of the face obtained by the trained target neural network and the key point information detection result of the face organ have consistency and higher detection accuracy.

As described in the above-mentioned disclosed embodiments, the implementation manner of step S14 can be flexibly determined according to practical situations. In one possible implementation, step S14 may include at least one of the following processes:

determining the error loss of a target neural network according to a first error between the key point information of the face and the key point information label of the face;

determining the error loss of the target neural network according to a second error between the key point information of the face organ and the key point information label of the face organ;

and determining the position label of the detection frame of at least one human face organ in the human face image according to the key point information label of the human face and/or the key point information label of the human face organ, and determining the error loss of the target neural network according to a third error between the position label of the detection frame of at least one human face organ and the position label of the detection frame of at least one human face organ.

As described in the foregoing disclosure embodiments, the face image may include a key point information label of the face, and the label may indicate an actual position of the key point information of the face in the training image, and therefore, in a possible implementation manner, an error loss of the target neural network may be determined according to a first error formed between the key point information label of the face and the key point information of the face predicted by the target neural network. The specific error loss calculation mode can be flexibly set according to the actual situation, and is not limited in the embodiment of the disclosure.

Similarly, the error loss of the target neural network can be determined according to a second error formed between the key point information label of the human face organ and the key point information of the human face organ predicted by the target neural network, and the calculation mode can also be flexibly selected according to the actual situation.

In a possible implementation manner, as described in the above disclosed embodiments, the first network branch in the target neural network may further determine detection frame information of at least one face organ, and based on the key point information labels of the face and the key point information labels of the face organs, each organ in the face may also be located, so as to calculate the detection frame position of each organ in the training image as the detection frame position label in the face image, therefore, in a possible implementation manner, the error loss of the target neural network may also be determined based on a third error formed between the detection frame information of each organ predicted by the target neural network and the detection frame position labels of the corresponding organs, the calculation manner of the detection frame position labels, and the calculation manner of the error loss of the target neural network determined based on the third error may all be flexibly selected according to the actual situation, and are not limiting in the embodiments of the present disclosure.

In a possible implementation manner, the above manners for determining the error loss of the target neural network may be combined with each other, specifically which manner or manners are selected to determine the error loss of the target neural network together, and may also be flexibly selected according to the actual situation, which is not limited in the embodiment of the present disclosure.

The error loss of the target neural network is determined through the various processes, so that the training process of the target neural network is more flexible and richer, the target neural network obtained through training has a better key point detection effect, and the obtained key point information of the face and the key point information of the face organs have higher consistency.

Fig. 6 shows a block diagram of a keypoint detection apparatus according to an embodiment of the present disclosure. As shown, the key point detecting device 20 may include:

and the image acquisition module 21 is used for acquiring a face image.

The key point detection module 22 is configured to process a face and at least one face organ in a face image by using at least two neural network branches included in a target neural network, so as to obtain a face key point information set, where the face key point information set includes key point information of the face and key point information of the face organ; the at least two neural network branches comprise a first network branch for detecting a human face and at least one second network branch for detecting human face organs, a first detection result output by the first network branch comprises global information of the human face, the key point information of the human face and/or the feature information of a human face image included in the global information of the human face are transmitted to the second network branches, and the second network branches are used for determining the corresponding key point information of the human face organs by combining the global information of the human face.

In one possible implementation, the second network branch includes a region of interest calibration layer and a feature extraction layer; the key point detection module is used for: detecting the face through a first network branch to obtain a first detection result; extracting a detection area of the human face organ through the interested area calibration layer of each second network branch; extracting image features of a detection area corresponding to at least one face organ through the feature extraction layer of each second network branch; and obtaining key point information of at least one face organ by using the image characteristics of the detection area and the global information of the face.

In one possible implementation, the key point detection module is further configured to: performing at least one fusion processing on the image characteristics of the detection area and corresponding key point information of the human face and/or characteristic information of the human face image to obtain fusion characteristic information; and obtaining key point information of the face organ according to the fusion characteristic information.

In a possible implementation manner, the first detection result further includes detection frame information of at least one face organ; the key point detection module is further configured to: determining the position coordinates of at least one face organ in the face image according to the detection frame information of at least one face organ; and extracting the detection areas of the human face organs respectively matched with the detection frame information of at least one human face organ under the precision of the position coordinates through the interested area calibration layer of each second network branch.

In one possible implementation, the detection region of the face organ includes: and extracting the characteristic information of the face organ region from the face image and/or extracting the characteristic information of the face organ region from the characteristic information of the face image output by the first network branch.

In a possible implementation manner, the first detection result further includes detection frame information of at least one face organ; the key point detection module is further configured to: and performing enhancement processing on the detection frame information of at least one human face organ, wherein the enhancement processing comprises the following steps: a scaling transform process and/or a translation transform process.

In one possible implementation, the number of the key points of the face in the key point information of the face includes 68 to 128; and/or the number of the key points of the mouth in the key point information of the human face organ comprises 40-80; and/or the number of the key points of the left eye in the key point information of the face organ comprises 16 to 32; and/or the number of key points of the right eye in the key point information of the face organ comprises 16 to 32; and/or the number of key points of the left eyebrow in the key point information of the face organ comprises 10-20; and/or the number of the key points of the right eyebrow in the key point information of the human face organ comprises 10-20.

Application scenario example

The application example of the disclosure provides a key point detection method, which can detect key points of a face image with a face in an image under a shielding condition.

Fig. 3 and fig. 7 are schematic diagrams illustrating a keypoint detection method according to an application example of the present disclosure, where fig. 3 is a schematic diagram illustrating an application process of keypoint detection, and fig. 7 is a schematic diagram illustrating a training process of keypoint detection, as shown in fig. 3, in an application example of the present disclosure, the keypoint detection method may include the following processes:

as shown in the figure, in the application example of the present disclosure, after the acquired face image with the occlusion condition is input into the target neural network, the face image is processed through the first network branch and the five second network branches in the target neural network, respectively.

As shown in the figure, the first network branch includes a shallow feature extraction module and a main module, which are connected in sequence, the shallow feature extraction module performs preliminary extraction on feature information of the face image according to the shallow feature extraction network structure block 0 described in the above disclosed embodiments, and the main module performs further extraction and regression on feature information of the face image according to the deep feature extraction network structure block main described in the above disclosed embodiments. As can be seen from the figure, after the first network branch processes the face image, the key point information of 106 faces (i.e. 106 whole face key points in the figure) and the detection frame information of each face organ in the face image can be respectively output.

Further, as shown in the figure, the five second network branches are independent from each other, and perform key point detection on the mouth, the left eye, the right eye, the left eyebrow, and the right eyebrow in the human face respectively. The second network branch of the mouth includes a region of interest calibration layer (ROI Align) and a mouth feature extraction module, which are connected in sequence, and the implementation form of the region of interest calibration layer may refer to the above-mentioned embodiments, which are not described herein again. As can be seen from the figure, the region-of-interest calibration layer may cut the face image according to the detection frame information of the mouth output by the first network branch, so as to obtain a face organ region of the mouth that matches the preset image size. The mouth feature extraction module may include one or more network layers for feature extraction, and may perform feature extraction on a face organ region of the mouth to obtain feature information of the face organ region of the mouth, as can be seen from the figure, in an example, the feature information of the face organ region of the mouth may be fused with key point information of 106 faces output in the first network branch to obtain fused feature information, and the fused feature information is regressed through the second network branch of the mouth to output the key point information of the face organ of the mouth. As shown in the figure, in the application example of the present disclosure, the second network branch of the mouth extracts a face organ region of the mouth from the face image according to the detection frame information of the input mouth, and fuses the feature information of the face organ region and the key point information of the face to obtain fused feature information, and based on the fused feature information, key point information of face organs of 64 mouths can be output. In one example, the key point information of the face organs of the 64 mouths and the key point information of the 106 faces can be unified to the same position coordinate system by the position conversion method mentioned in the above-mentioned embodiments of the disclosure

The implementation form of the second network branch for the left eye and the right eye may refer to the second network branch for the mouth, which is not described herein again. As shown in the figure, the second network branch for the left eye may output the key point information of the face organs of the 24 left eyes, and the second network branch for the right eye may output the key point information of the face organs of the 24 right eyes.

The implementation form of the second network branch of the left eyebrow is similar to that of the second network branch of the mouth, and as can be seen from the figure, in the application example of the present disclosure, the second network branch of the left eyebrow may utilize the region-of-interest calibration layer, based on the detection frame information of the left eyebrow, extract the initial feature information of the face organ region of the left eyebrow from the feature information of the face image output by block 0 in the first network branch, and perform deep feature extraction based on the initial feature information to obtain the feature information of the face organ region, and the rest processes are the same as those of the second network branch of the mouth. The second network branch of the right eyebrow can be implemented with reference to the left eyebrow, and is not described herein. As can be seen from the figure, in the application example of the present disclosure, the second network branch of the left eyebrow can output the key point information of the face organs of 13 left eyebrows, and the second network branch of the right eyebrow can output the key point information of the face organs of 13 right eyebrows.

After the key points of the face and the key points of the face organs are obtained, some post-processing, such as point sequence adjustment, can be performed on the obtained key points to meet the actual use requirements of key point detection.

Through the process, the detection results of key point information of the face and key point information of each face organ can be simultaneously obtained by using a single target neural network, and the ROI Align of the region of interest is used for extracting the region of the face organ, so that the key point detection time in the whole process is saved, the total time consumption of key point detection is reduced, the precision of the obtained face organ region is improved, and the precision of the key point information of the detected face organ is improved; meanwhile, in each second network branch, the feature information of the face organ region can be fused with the key point information of the face output by the first network branch to obtain fused feature information, so that the key point information of the face organ output is obtained according to the fused feature information.

Further, since the key point detection method proposed by the application example of the present disclosure may pass through the target neural network, the method proposed by the application example of the present disclosure may also be used in a training process for the target neural network. As shown in fig. 7, the process of training the target neural network is substantially the same as the process of the application process, except that, in the training process, the face image includes the true value labels of the key points, and the detection frame information output by the first network branch is enhanced and then input to the second network branches, and the enhancement processing mode may refer to the above-described embodiments, which is not described herein again. In one example, in the training process, to determine the positions of the left and right eyebrows, the positions of the detection frames of the left and right eyebrows may be calculated according to the truth labels of the key points in the training image, so as to obtain the truth detection frames (i.e., the detection frames (truth values) in the figure) of the left and right eyebrows, and input the truth detection frames into the corresponding second network branches.

In the training process, the first network branch and each second network branch can be trained simultaneously and carry out parameter optimization together, so that global optimization is achieved. Through the training process, end-to-end global optimization of the whole target neural network can be realized, and therefore the key point detection precision of the target neural network is improved.

The key point detection method provided in the application example of the present disclosure can be applied to not only key point detection of a face image, but also extended application to processing of other images, such as a human body image, a bone image, and the like.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile computer readable storage medium or a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.

In practical applications, the memory may be a volatile memory (RAM); or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.

The processor may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic devices for implementing the above-described processor functions may be other devices, and the embodiments of the present disclosure are not particularly limited.

The electronic device may be provided as a terminal, server, or other form of device.

Based on the same technical concept of the foregoing embodiments, the embodiments of the present disclosure also provide a computer program, which when executed by a processor implements the above method.

Fig. 8 is a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 8, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related personnel information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 9 is a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 9, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), can execute computer-readable program instructions to implement various aspects of the present disclosure by utilizing state personnel information of the computer-readable program instructions to personalize the electronic circuitry.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for detecting a keypoint, comprising:

acquiring a face image;

processing a face and at least one face organ in the face image by using at least two neural network branches included by a target neural network to obtain a face key point information set, wherein the face key point information set comprises key point information of the face and key point information of the face organ;

the at least two neural network branches comprise a first network branch for detecting a human face and at least one second network branch for detecting human face organs, a first detection result output by the first network branch comprises global information of the human face, the key point information of the human face and/or the feature information of a human face image included in the global information of the human face are transmitted to each second network branch, and each second network branch is used for determining the corresponding key point information of the human face organs by combining the global information of the human face.

2. The method of claim 1, wherein the second network branch comprises a region of interest alignment layer and a feature extraction layer;

the detecting the face and at least one face organ in the face image by using at least two neural network branches included by the target neural network to obtain a face key point information set comprises the following steps:

detecting the face through the first network branch to obtain a first detection result;

extracting a detection area of a human face organ through the interested area calibration layer of each second network branch;

extracting image features of a detection area corresponding to the at least one face organ through a feature extraction layer of each second network branch;

and obtaining the key point information of the at least one face organ by using the image characteristics of the detection area and the global information of the face.

3. The method according to claim 2, wherein the obtaining the key point information of the at least one face organ by using the image features of the detection region and the key point information of the face comprises:

performing at least one fusion processing on the image characteristics of the detection area and the corresponding key point information of the face and/or the characteristic information of the face image to obtain fusion characteristic information;

4. The method according to claim 2 or 3, wherein the first detection result further comprises detection frame information of at least one human face organ;

the extracting of the detection area of the human face organ through the interested area calibration layer of each second network branch comprises:

determining the position coordinates of the at least one human face organ in the human face image according to the detection frame information of the at least one human face organ;

and extracting the detection areas of the human face organs respectively matched with the detection frame information of the at least one human face organ through the interested area calibration layer of each second network branch under the precision of the position coordinates.

5. The method according to any one of claims 2 to 4, wherein the detection area of the human face organ comprises: and extracting the characteristic information of the face organ region from the face image and/or extracting the characteristic information of the face organ region from the characteristic information of the face image output by the first network branch.

6. The method according to any one of claims 2 to 5, wherein the first detection result further comprises detection frame information of at least one human face organ;

before the extracting, by the region of interest calibration layer of each second network branch, a detection region of a human face organ, the method further includes:

performing enhancement processing on the detection frame information of the at least one human face organ, wherein the enhancement processing comprises the following steps: a scaling transform process and/or a translation transform process.

7. The method according to any one of claims 1 to 6, wherein the number of the key points of the face in the key point information of the face comprises 68 to 128; and/or the presence of a gas in the gas,

the number of the key points of the mouth in the key point information of the human face organ comprises 40-80; and/or the presence of a gas in the gas,

the number of the key points of the left eye in the key point information of the human face organ comprises 16 to 32; and/or the presence of a gas in the gas,

the number of key points of the right eye in the key point information of the face organ comprises 16 to 32; and/or the presence of a gas in the gas,

the number of key points of the left eyebrow in the key point information of the face organ comprises 10-20; and/or the presence of a gas in the gas,

the number of the key points of the right eyebrow in the key point information of the face organ comprises 10-20.

8. The method of any one of claims 1 to 7, wherein the face image includes keypoint annotations, the method further comprising:

determining the error loss of the target neural network according to the key point labels and the face key point information set;

and jointly updating the parameters of at least two neural network branches in the target neural network according to the error loss.

9. A keypoint detection device, comprising:

the image acquisition module is used for acquiring a face image;

the key point detection module is used for processing a face and at least one face organ in the face image by using at least two neural network branches included by a target neural network to obtain a face key point information set, wherein the face key point information set comprises key point information of the face and key point information of the face organ;

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 8.

11. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 8.