WO2022142298A1 - 关键点检测方法及装置、电子设备和存储介质 - Google Patents

关键点检测方法及装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2022142298A1
WO2022142298A1 PCT/CN2021/108612 CN2021108612W WO2022142298A1 WO 2022142298 A1 WO2022142298 A1 WO 2022142298A1 CN 2021108612 W CN2021108612 W CN 2021108612W WO 2022142298 A1 WO2022142298 A1 WO 2022142298A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
key point
information
organ
point information
Prior art date
Application number
PCT/CN2021/108612
Other languages
English (en)
French (fr)
Inventor
李思颖
陈祖凯
王权
钱晨
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Publication of WO2022142298A1 publication Critical patent/WO2022142298A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Definitions

  • the present disclosure relates to the field of computer vision, and in particular, to a key point detection method and device, an electronic device and a storage medium.
  • Face key point detection is the basis of many face-related applications. It can provide position correction for technologies such as face recognition, and provide semantic information of faces for scenes such as augmented reality and beauty special effects. Therefore, how to detect the key points of the face has become an urgent problem to be solved.
  • the present disclosure proposes a key point detection scheme.
  • a key point detection method comprising:
  • Obtain a face image use at least two neural network branches included in the target neural network to detect the face and at least one face organ in the face image, and obtain a set of face key point information, the face key point
  • the information set includes first key point information of the human face and second key point information of the at least one face part.
  • a key point detection device including:
  • the image acquisition module is used to acquire a face image;
  • the key point detection module is used to detect the face and at least one face organ in the face image by using at least two neural network branches included in the target neural network, and obtain A face key point information set, the face key point information set includes first key point information of the face and second key point information of the at least one face organ.
  • an electronic device comprising:
  • processor a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: execute the above key point detection method.
  • a computer-readable storage medium having computer program instructions stored thereon, the computer program instructions implementing the above key point detection method when executed by a processor.
  • the first image including the face is obtained.
  • FIG. 1 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure.
  • FIG. 2 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure.
  • FIG. 3 shows a schematic diagram of a keypoint detection method according to an application example of the present disclosure.
  • FIG. 4 shows a schematic diagram of a key point detection method according to an application example of the present disclosure.
  • FIG. 5 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure.
  • FIG. 6 shows a block diagram of a keypoint detection apparatus according to an embodiment of the present disclosure.
  • FIG. 7 shows a schematic diagram of a keypoint detection method according to an application example of the present disclosure.
  • FIG. 8 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 9 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 1 shows a flowchart of a key point detection method according to an embodiment of the present disclosure.
  • the method may be applied to a key point detection apparatus, and the key point detection apparatus may be a terminal device, a server, or other processing devices.
  • the terminal device may be User Equipment (UE), mobile device, user terminal, terminal, cellular phone, cordless phone, Personal Digital Assistant (PDA), handheld device, computing device, vehicle-mounted device, wearable devices, etc.
  • UE User Equipment
  • PDA Personal Digital Assistant
  • the key point detection method can be applied to a cloud server or a local server
  • the cloud server can be a public cloud server or a private cloud server, which can be flexibly selected according to the actual situation.
  • the keypoint detection method can also be implemented by the processor calling computer-readable instructions stored in the memory.
  • the key point detection method may include:
  • Step S11 acquiring a face image.
  • step S12 at least two neural network branches included in the target neural network are used to detect the face and at least one face organ in the face image, and obtain a set of key point information of the face, and the set of key point information of the face includes the face information set.
  • the face image may be an image frame containing a human face, and the implementation form may be flexibly determined according to the actual situation, which is not limited in the embodiments of the present disclosure.
  • the number of faces included in the face image is also not limited in the embodiments of the present disclosure, and may include the face of a certain object, and may also include the faces of multiple objects at the same time.
  • the face image contains multiple faces, the first key point information of the face corresponding to the face of each object and/or at least one person can be obtained through the key point detection method proposed in the embodiment of the present disclosure.
  • Subsequent disclosed embodiments are described by taking the human face image including the human face of a single object as an example.
  • the situation of the face including multiple objects in the face image can be flexibly expanded with reference to subsequent disclosed embodiments, and will not be listed one by one.
  • the acquisition method of the face image is not limited in the embodiments of the present disclosure.
  • the face image can be obtained by reading the database storing the face image, or the face image can be collected in some scenarios to obtain the face image.
  • a face image, or a face image is obtained by clipping or sampling a segment from a video containing a human face, etc.
  • the specific acquisition method can be flexibly determined according to the actual situation.
  • the face image may be an image obtained from a sequence of face image frames.
  • the face image frame sequence may be a frame sequence including multiple frames of face images, and an implementation manner thereof is not limited in this embodiment of the present disclosure.
  • the manner of obtaining a face image from a sequence of face image frames is not limited in the embodiments of the present disclosure, and may be randomly sampled from a sequence of face image frames to obtain a face image, or may be obtained from a face image frame sequence according to preset requirements. The selected face image in the face image frame sequence.
  • the number of face images acquired in step S11 is not limited in this embodiment of the present disclosure, and can be flexibly selected according to actual conditions. Subsequent disclosed embodiments are described by taking obtaining a single face image as an example, and the processing method after obtaining multiple face images can be flexibly expanded with reference to subsequent disclosed embodiments, which will not be repeated in the embodiments of the present disclosure.
  • the target neural network can be a neural network for processing face images, and its implementation form can be flexibly determined according to the actual situation.
  • the target neural network may include at least two neural network branches.
  • the number of neural network branches included in the target neural network, the implementation form and connection relationship of each neural network branch, etc., can be flexibly determined according to the actual situation.
  • the face and at least one face part in the face image can be processed to obtain the face key point information set.
  • the set of face key point information may be the relevant information of the face key points contained in the face image, and what information content is included can be flexibly determined according to the actual situation.
  • the set of face key point information may include first key point information of the face and second key point information of at least one face part.
  • the first key point information of the human face may include: locating various parts or organs in the human face, so as to determine the key points of the overall situation of the human face, such as the eyes, mouth, nose, eyebrows and other organs in the human face. Key points, and key points on parts such as the cheeks, forehead, or chin of a person, etc.
  • the number of the first key points in the first key point information of the face and which key points in the face are specifically included can be flexibly set according to the actual situation, and are not limited in the embodiments of the present disclosure.
  • the number of the first key points of the human face may be in the range of 68 to 128, and the specific number to be set may be flexibly selected according to the actual situation.
  • the number of the first key points of the human face can be set to 106, which is denoted as Face 106.
  • the neural network branch in the target neural network can perform information detection of the first key point of the face on the input face image, thereby outputting 106 information of the first key point of the face.
  • the second key point information of the face organs may include: key points contained in various parts or organs of the human face that can be used to determine the conditions of the parts or organs of the human face, such as eyes, mouth, nose or Key points contained in organs such as eyebrows, etc.
  • the second key point information of the human face may describe the part or organ of the same human face as the first key point information of the human face.
  • the second key point of the face organ is Compared with the first key point information describing the face of the same part or organ, the number of key points may be larger, and the distribution on the corresponding part or organ may also be more dense.
  • the number of the second key points of the face organs and the key points of which organs or parts in the face are specifically included, these conditions can be flexibly set according to the actual situation, in the embodiment of the present disclosure
  • the number of key points for the mouth may be in the range of 40 to 80
  • the number of key points for the left eye may be in the range of 16 to 32
  • the number of key points for the right eye It can also be in the range of 16 to 32
  • the number of key points of the left eyebrow can be in the range of 10 to 20
  • the number of key points of the right eyebrow can also be in the range of 10 to 20.
  • the specific number of key points of the organ can be flexibly selected according to the actual situation.
  • the number of key points of the mouth in the face can be set to 64, denoted as mouth 64, the number of key points of the left eye and the right eye are 24, both denoted as eye 24, the left eyebrow and right
  • the number of key points of the eyebrows is 13, and they are all recorded as eyebrow 13.
  • the first image including the face is obtained.
  • a set of face key point information including key point information and second key point information of the face organ is obtained.
  • the neural network branches included in the target neural network can be flexibly determined according to the actual situation.
  • the at least two neural network branches may include a first network branch for detecting a human face, and at least one second network branch for detecting at least one face part.
  • FIG. 2 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure.
  • step S12 may include:
  • Step S121 detecting the human face through the first network branch to obtain a first detection result, where the first detection result includes first key point information of the human face and detection frame information of at least one facial organ;
  • Step S122 based on the first detection result, detect at least one face organ through at least one second network branch to obtain a second detection result, where the second detection result includes second key point information of the at least one face organ.
  • the first network branch may be a network structure that detects the face in the face image, thereby obtaining the first key point information of the face, and its implementation form can be flexibly determined according to the actual situation.
  • the following public implementations For example, we will not expand it here.
  • the first detection result may be the detection result obtained by the first network branch detecting the face in the face image.
  • the first detection result may include first key point information of the human face and detection frame information of at least one facial organ.
  • the first detection result may further include other information, for example, may also include first feature information of a face image obtained by performing feature extraction on a face in a face image, the face image
  • the first feature information in the face image can reflect the overall feature of the face in the face image.
  • the specific implementation form of the first feature information of the face image can be found in the following disclosed embodiments, which will not be expanded here.
  • the second network branch can be a network structure that detects each organ of the face in the face image, thereby obtaining the second key point information of the face organ, and its implementation form can also be flexibly determined according to the actual situation.
  • the disclosed embodiments are not expanded here.
  • the second detection result may be a detection result obtained by at least one second network branch detecting at least one facial organ of the human face in the human face image based on the first detection result.
  • the second detection result may include second key point information of at least one facial organ. Since the first detection result is a detection result that reflects the overall situation of the face obtained by detecting the face in the face image, the second key point of the face organ in the second detection result is determined based on the first detection result information can make the second key point information of the face organ more unified with the first key point information of the face in the first detection result, so that the set of face key point information is more accurate. How the at least one second network branch determines the second key point information of the at least one face organ, the detection process thereof can be referred to in the following disclosed embodiments, which will not be described here.
  • the target neural network can be used to obtain the first key point information of the face and the second key point information of at least one face organ at the same time.
  • the end-to-end face key point can be realized.
  • the second key point information of each face organ can be located by using the first detection result reflecting the overall situation of the face, so that the obtained face key point information is collected in the whole face and the position of the local organs They are relatively uniform and have high accuracy, which effectively improves the accuracy of the identified key points.
  • step S121 may include:
  • the first key point information of the face and the detection frame information of at least one face organ are obtained.
  • step S121 may further include: obtaining first feature information of the face image according to an intermediate feature output by at least one network layer in the first network branch.
  • the first network branch may be a network structure for detecting the first key point information of the face on the face image, therefore, in the case where the face image is used as the input, in a possible implementation manner, the According to the output of the first network branch, the first key point information of the face is obtained.
  • the detection frame information of at least one face organ may be a detection frame representing the position of each organ in the face image.
  • the positions of which organs in the face image are specifically expressed can be flexibly set according to the actual situation, which is not limited in the embodiments of the present disclosure.
  • the shape and implementation form of the detection frame can be flexibly determined according to the actual situation, and the shape and implementation form of the detection frame of different organs can be the same or different.
  • the detection frame of each organ may be a rectangular frame, and the position of each detection frame may be represented by the coordinates of the vertices in the rectangular frame.
  • the first network branch of the target neural network can output the detection frame information of at least one face organ while outputting the first key point information of the face.
  • the first feature information of the human face image may be determined by the first network branch according to the entirety of the human face in the human face image, and reflects the feature information of the overall situation of the human face. How to obtain the first feature information of the face image and which feature data is included can be flexibly determined according to the actual situation, and is not limited to the following disclosed embodiments.
  • the implementation form of the first network branch may be flexibly determined according to actual conditions.
  • the first network branch may be formed by a plurality of connected network layers, and the specific implementation form and combination form of these network layers may be flexibly determined according to the actual situation, and are not limited to the following disclosed embodiments .
  • the first network branch may include a shallow feature extraction network structure block 0 and a deep feature extraction network structure block main connected in sequence, and block 0 and block main may respectively include multiple convolutional layers such as convolutional layers. , a pooling layer, or a network layer in the form of a classification layer.
  • the first network branch can initially extract the feature information of the face image, and the obtained preliminary extracted features can be input into the block main for further extraction of the feature information of the face image, and based on the further extracted feature information. Regression, so as to obtain the output of the first network branch to realize the detection of the first key point information of the face.
  • the preliminary extraction feature extracted by block 0 can be used as the first feature information of the face image; in an example, the deep feature extracted by block main can be used as the first feature information of the face image; In the example, the preliminary extracted features extracted by block 0 and the deep features further extracted by block main can be used as the first feature information of the face image, and which features to choose as the first feature information of the face image can be flexibly determined according to the actual situation. , which is not limited in the embodiments of the present disclosure.
  • the implementation order of the above steps is not limited, and can be flexibly determined according to actual conditions.
  • the first key point information of the face, the detection frame of at least one face organ, and the overall features of the face can be obtained at the same time, or the first key points of the face can be obtained in a certain order. Point information, the detection frame of at least one face organ, and the overall features of the face, etc.
  • step S122 can be flexibly determined according to the actual situation.
  • step S122 may include: for each second network branch in the at least one second network branch, using the second network branch to detect a face organ corresponding to the second network branch, A second detection result of the face organ corresponding to the second network branch is obtained.
  • the second key point information of the face organs included in different organs in the face can be obtained through multiple second network branches included in the target neural network, respectively.
  • the second key point information of the mouth can be detected through the second network branch of the mouth
  • the second key point information of the eyes can be detected through the second network branch of the eyes. Therefore, in a possible implementation manner, the number of second network branches included in the target neural network and the organs used for detection by the second network branch can be flexibly determined according to the actual situation.
  • the second network branches are independent of each other, that is, the order and process of the detection performed by each second network branch are not affected by the detection process or data of other second network branches. interference.
  • the structures of the second network branches may be the same or different, which are not limited in the embodiments of the present disclosure.
  • the detection process in each second network branch may be the same or different, which is also not limited in this embodiment of the present disclosure.
  • each second network branch in the at least one second network branch By using each second network branch in the at least one second network branch to detect the face part corresponding to the second network branch, a second detection result of the face part corresponding to the second network branch is obtained.
  • multiple second network branches can be used to independently perform face organ detection on multiple face organs in the face image, thereby improving the efficiency and flexibility of key point detection.
  • detecting the face organ corresponding to each second network branch, and obtaining the second detection result of the face organ corresponding to the second network branch may include:
  • second key point information of the face organ is determined.
  • the face organ region may be a region in the face image including the face organ, as described in the above disclosed embodiments.
  • the detection frame information of the face organs may be detection frames representing the positions of each organ in the face image. In a possible implementation manner, based on the detection frame information of the face parts, the position of the face parts in the face image can be determined. Therefore, the detection frame information of the face parts can be extracted from the face image. Matching face organ regions.
  • the face image may be cut according to the position determined by the detection frame information of the face organ to obtain face organ region.
  • the detection frame of the mouth organ may be used to cut the face image in the second network branch for detecting the key points of the mouth to obtain the face organ region of the mouth.
  • the cutting method can be flexibly determined according to the actual situation. For details, please refer to the following disclosed embodiments, which will not be expanded here.
  • the second network branch can further extract the second feature information of the face organ region.
  • the second feature information of the face organ region may be feature information reflecting local conditions of the face organ, and the extraction method can be flexibly determined according to the actual situation, and is not limited to the following disclosed embodiments.
  • a network layer with a feature extraction function in the second network branch can perform shallow feature extraction and/or deep feature extraction on the face organ region, and the extracted face organ region The superficial features and/or deep features are used as the second feature information of the face organ region, etc.
  • the second feature information of the face organ region can be processed to obtain second key point information of the face organ region.
  • the manner of processing the second feature information of the facial organ region is not limited in the embodiments of the present disclosure.
  • a network layer with a key point calculation or identification function in the second network branch can perform regression calculation on the second feature information of the face organ region to obtain the second key point of the face organ information.
  • the second feature information of the face organ region can also be combined with part or all of the first detection result through a network layer with a key point calculation or identification function in the second network branch. The information is used to perform regression calculation, so as to determine the second key point information of the face organs. Specifically, how to determine the second key point information of the face organ region according to the second feature information of the face organ region and the first detection result, the implementation form of which can be found in the following disclosed embodiments, which will not be expanded here.
  • the second key point information of the face organ is determined, and the second key point information of the face organ can be detected based on the extracted face organ area, which effectively improves the detection accuracy of the second key point of the face organ.
  • the overall detection accuracy of the keypoint detection method is improved.
  • the extracted face organ region matches the detection frame information of the face organ included in the first detection result, so the obtained second key point information of the face organ matches the first key point of the face output by the first network branch
  • the information is related, which further improves the overall accuracy of the target neural network for detecting face key points.
  • the second key point information of the face organ is determined based on the second feature information of the face organ region and the first detection result. Since the first detection result is the result obtained by detecting the face in the face image, Therefore, based on the first detection result, a corresponding positioning reference can be made to the positions of each organ in the face image, so that the determined second key point information of the face organ can have higher accuracy and is consistent with the first key point of the face. The location of the information can be more uniform, thus further improving the detection accuracy of key points.
  • the first detection result may include the first feature information of the face image. Therefore, in a possible implementation manner, the face organs corresponding to each second network branch are detected, Obtaining the second detection result of the face organ corresponding to the second network branch may also include:
  • the second key point information of the face organ is determined.
  • the initial feature information of the face organ region may be feature information related to the face organ region in the first feature information of the face image, as described in the above disclosed embodiments.
  • the detection frame information of the face organs may be detection frames representing the positions of each organ in the face image. In a possible implementation manner, based on the detection frame information of the face organs, the position of the face organ region in the first feature information of the face image can be determined. Therefore, according to the detection frame information of the face organs, from The initial feature information of the matched face organ region is extracted from the first feature information of the face image.
  • the manner of extracting the initial feature information of the facial organ region is not limited in the embodiments of the present disclosure.
  • the feature map of the face image can be determined according to the position determined by the detection frame information of the face organ. Cut to obtain the initial feature information of the face organ region.
  • the detection frame of the left eye organ can be used to cut the feature map of the face image in the second network branch for left eye key point detection to obtain the left eye face organ.
  • the initial feature information of the region For the cutting forms of other organs, such as the mouth or the eyebrows, reference may be made to the above disclosed embodiments, and details are not described herein again.
  • the cutting method can be flexibly determined according to the actual situation. For details, please refer to the following disclosed embodiments, which will not be expanded here.
  • the second network branch can further perform deep feature extraction on the initial feature information to obtain the second feature information of the face organ region.
  • the implementation form of the second feature information of the face organ region can be found in the above disclosed embodiments, and details are not described herein again.
  • the manner of performing deep feature extraction on the initial feature information is also not limited in the embodiments of the present disclosure.
  • a network layer with a feature extraction function in the second network branch can further perform deep feature extraction on the initial feature information of the facial organ region, and extract the deep features of the extracted facial organ region. The feature is used as the second feature information of the face organ region, and so on.
  • the second key point information of the face organ region can be determined based on the second feature information of the face organ region and the first detection result.
  • this process please refer to The above disclosed embodiments are not repeated here.
  • the second feature information of the face organ region is obtained.
  • feature information so that based on the second feature information of the face organ region and the first detection result, the second key point information of the face organ is determined, so that the second network branch and the first network branch can share part of the feature extraction network layer structure.
  • extracting the face organ region can be flexibly determined according to the actual situation.
  • extracting a face organ region matching the detection frame information of the face organ from the face image may include:
  • the detection frame information of the face organs may be detection frames representing the positions of each organ in the face image. Therefore, based on the detection frame information, the position coordinates of the facial organs in the facial image can be determined.
  • the region of the face organs can be extracted with the accuracy of the position coordinates through the region of interest calibration layer of the second network branch.
  • the precision of the position coordinates may be the numerical precision of the position coordinates.
  • the vertex coordinates of the detection frame in the detection frame information may be floating-point numbers, and the precision of the position coordinates determined based on the detection frame information may be consistent with the floating-point digits of the vertex coordinates of the detection frame.
  • the vertex coordinates of the detection frame may be integers, and the precision of the position coordinates may be integers.
  • the region of interest alignment layer may be a network layer with an image cropping function, and its implementation form is not limited in the embodiments of the present disclosure. It can be seen from the above disclosed embodiments that, in a possible implementation manner, in the process of clipping the face image by the region-of-interest calibration layer, the clipping accuracy can be the same as the accuracy of the face organs in the face image. The accuracy of the position coordinates is the same. Therefore, in a possible implementation, when the precision of the position coordinates is a floating-point number, the calibration layer of the region of interest can be a network layer that can perform image clipping in floating-point precision. Therefore, any network layer with this function All the network layers can be used as the realization form of the calibration layer of the region of interest.
  • the face organ region can effectively improve the accuracy of the extracted face organ region, and then improve the accuracy of the second key point information of the face organ determined based on the face organ region, thereby improving the accuracy of key point detection.
  • the method of extracting the initial feature information of the face organ region from the first feature information of the face image can be realized by referring to the above-mentioned method of extracting the face organ region from the face image, here No longer.
  • determining the second key point information of the face organ may include:
  • the key point information of the face organs is obtained.
  • the second feature information of the face organ region may be the second feature information obtained by extracting the face organ region and/or the initial feature information of the face organ region in the second network branch.
  • the feature information and its implementation form reference may be made to the above disclosed embodiments, which will not be repeated here.
  • the first feature of the face image in the first detection result obtained in the first network branch may be The information and/or the first key point information of the face is fused with the second feature information of the face organ region to obtain the fusion feature information. Since the first feature information of the face image and/or the first key point information of the face are obtained based on the complete face image, which can reflect the whole face, the obtained face is further regressed according to the fusion feature information.
  • the second key point information of the face organ can have relatively uniform position information with the obtained first key point information of the face, and therefore has higher precision.
  • FIG. 3 shows a schematic diagram of an application example according to the present disclosure.
  • each face can be obtained.
  • the second feature information of the organ region, the second feature information of these face organ regions can be fused with the 106 overall key points of the face (that is, the first key point information of the face) output by the first branch in the figure.
  • the form of can be connection or other forms, etc., to obtain fusion feature information (not marked in the figure).
  • the fusion feature information can be further calculated to obtain the second key points of the face organs output by each second branch (eg, 64 key points for the mouth, 24 key points for the left eye, and 13 key points for the right eyebrow, etc.).
  • FIG. 4 shows a schematic diagram of an application example according to the present disclosure.
  • each face can be obtained.
  • the second feature information of the organ region, the second feature information of these face organ regions can be fused with the first feature information of the face image extracted by the shallow feature extraction module of the first branch in the figure, respectively, in the form of fusion.
  • the fusion feature information (not marked in the figure) can be obtained by addition or other forms.
  • the fusion feature information can be further calculated to obtain the second key points of the face organs output by each second branch (eg, 64 key points for the mouth, 24 key points for the left eye, and 13 key points for the right eyebrow, etc.).
  • the 106 overall key points of the face in FIG. 3 (that is, the first key point information of the face) and the first feature information of the face image in FIG. 4 may both be related to each face organ region The second feature information is fused.
  • the second feature information of the face organ region may also be fused with both the first key point information of the face and the first feature information of the face image.
  • the first feature information of the face image may include features extracted from a certain network layer or some network layers in the first network branch. Therefore, in the process of fusing the second feature information of the face organ region with the first feature information of the face image, the fusion may be performed with features extracted by any network layer in the first network branch. Which feature to be extracted by the network layer is specifically selected, whether it is a deep feature or a shallow feature, etc., can be flexibly determined according to the actual situation, which is not limited in the embodiments of the present disclosure. In the process of acquiring the fusion feature information, how many times of fusion is performed, which can be flexibly determined according to the object to be fused, so it is also not limited in this embodiment of the present disclosure.
  • the corresponding fusion feature information can be processed in the second network branch, so as to obtain the output second key point information of the face organ.
  • the processing manner of the fusion feature information by the second network branch is not limited in the embodiments of the present disclosure, and may be flexibly selected according to the actual situation.
  • the fusion feature information may be processed by a network layer such as a regression layer or a classification layer to obtain the second key point information of the output face organs.
  • the fusion feature information may also be processed through a network structure composed of multiple network layers, so as to obtain the output second key point information and the like of the face organ.
  • the fusion feature information is obtained, and according to the fusion feature information, the human face is obtained.
  • the second key point information of the face organ can be applied to the first feature information of the face image and/or the first key point information of the face that reflects the overall situation of the face obtained by the first network branch.
  • the obtained second key point information of the face organ is unified with the result of the first key point information of the face, and has higher precision.
  • the manner of fusion is not limited in the disclosed embodiments.
  • the fusion process may include at least one of the following operations: connection, addition, weighted fusion, and attention feature fusion.
  • connection can be directly splicing the fused objects to achieve fusion; the addition can be to add the fused objects on the corresponding pixels to obtain the fused features; weighted fusion can be to give the fused objects a certain value.
  • the preset weights are added according to the preset weights to achieve fusion; the attention feature fusion can be based on the attention mechanism, and the fusion objects are fused through operations such as connection and skip connection.
  • the second feature information of the face organ region can be combined with the first feature information of the face image and/or the first key point information of the face to realize various forms of fusion, thereby further increasing the fusion feature information. Comprehensiveness and accuracy, and then improve the accuracy of the second key point information of the face organs obtained based on the fusion feature information.
  • step S12 may further include:
  • Enhancement processing is performed on the detection frame of at least one face part, wherein the enhancement processing includes: scaling transformation processing and/or translation transformation processing.
  • the detection frame information may be the detection frame information of the face organs output by the first network branch, which will not be repeated here.
  • enhancement processing may be performed on the detection frame information, such as the scaling transformation processing and/or translation transformation processing mentioned in the above disclosed embodiments.
  • the scaling transformation process may be to expand or compress the detection frame in the obtained detection frame information. In a possible implementation manner, it may be to perform random scaling transformation on the detection frame within a preset scaling range.
  • the value of the preset expansion and contraction range can be flexibly set according to the actual situation, and is not limited to the following disclosed embodiments.
  • the preset scaling range may be between 0.9 times and 1.1 times the size of the detection frame.
  • the translation transformation process may be to move the overall position of the detection frame in the obtained detection frame information.
  • the detection frame may be randomly translated within a preset translation range.
  • the preset translation range can also be flexibly set according to the actual situation.
  • the preset translation range may be between ⁇ 0.05 times the length of the detection frame in the translation direction, wherein "+" and "-" in ⁇ represent the translation direction and the opposite direction of the translation direction, respectively.
  • the second detection result may be obtained by detecting at least one face organ through at least one second network branch by using the first detection result including the enhanced detection frame information in step S122.
  • the specific manner of detection reference may be made to the above disclosed embodiments, which will not be repeated here.
  • the richness of the training data for training the target neural network can be increased, so that the trained target neural network can obtain better key points under different input data.
  • the point detection effect improves the processing accuracy and robustness of the target neural network, thereby improving the accuracy of key point detection.
  • the key point detection method proposed in the embodiment of the present disclosure may further include:
  • the first key point of the face corresponding to the position of the face organ that meets the preset accuracy in the first key point information of the face is replaced to obtain an updated The first key point information of the face.
  • the preset precision can be flexibly set according to the actual situation, which is not limited in the embodiments of the present disclosure. It can be a certain precision set by humans, or it can be the precision of the first key point information of the face, etc.
  • the key points of the face organ may have high precision. And because the determined first key point information of the face and the second key point information of the face organs may contain the same key point in the same position, therefore, in some possible implementations, the first key point of the face The second key point of the face organ corresponding to the position exists, and if the second key point of the face organ meets the preset accuracy, the second key point of the face organ can be used as the first key point of the face at the corresponding position. key point, so as to realize the replacement of the first key point information of the face, and obtain the updated first key point information of the face.
  • the first key point of the face By acquiring the second key point of the face organ that meets the preset accuracy in the second key point information of the face organ, and according to the second key point of the face organ that meets the preset accuracy, the first key point of the face The first key point of the face corresponding to the face organs in the information is replaced to obtain the updated first key point information of the face, which can further improve the accuracy of key point detection and obtain the key points that meet the accuracy requirements.
  • the key point detection method proposed in the embodiment of the present disclosure may further include:
  • the position of the second key point information of each face organ in the face organ region is converted to obtain the position of the second key point information of the face organ in the face image.
  • the second key point information of the face organ may be obtained by extracting based on the face organ region
  • the obtained position of the second key point of the face organ may be in the human face. Location in the face organ area.
  • the position of the first key point of the human face may be the position in the human face image.
  • the position of the second key point information of the face organ in the face organ region can be converted to obtain the position of the second key point information of the face organ in the face image .
  • the conversion manner is not limited in the embodiments of the present disclosure.
  • the position transformation relationship between the face image and the face organ region can be determined according to the coordinates of the vertices or the center point of the face image and the face organ region, and based on the position transformation The position of the second key point of the face organ in the face organ region is transformed, so as to obtain the position of the second key point of the face organ in the face image.
  • the position of the second key point information of the face organ in the face organ region By converting the position of the second key point information of the face organ in the face organ region, the position of the second key point information of the face organ in the face image can be obtained, and the first key point information of the face can be converted into It is unified with the position of the second key point information of the face organ, which is convenient for subsequent analysis and operation processing of each key point in the face.
  • the position of the first key point information of the face in the face image can also be converted into the face organ region, or the second key point information of a person's face organ can be stored in the corresponding
  • the position in the face organ area is converted to other face organ areas, etc., and the first key point information of the face and the second key point information of each face organ can be converted to a preset image coordinate system. inferior.
  • the specific conversion method can be flexibly selected according to the actual situation, and is not limited to the above disclosed embodiments.
  • the target neural network may further include at least one third network branch, and the at least one third network branch may be used to detect the state of the face according to the first key point information of the face.
  • At least one third network branch is used to obtain the detection result of the face image in certain or certain states.
  • the state may be the situation of the face reflected in the face image, and the specific state may be flexibly set according to the actual situation.
  • the state may be the situation of the face itself, such as whether the eyes in the face are open or closed, which object the face corresponds to, and so on.
  • the state may also be the situation of the human face in the image, such as whether the human face is occluded in the image, and the like.
  • the number of states that can be detected by at least one third network branch is also not limited in this embodiment of the present disclosure.
  • a third network branch may output the detection result of the face image in one state.
  • a third network branch can also output the detection results of the face image in multiple states.
  • the number of third network branches included in the target neural network is also not limited in this embodiment of the present disclosure.
  • the target neural network may include multiple third network branches, and through the multiple third network branches, the detection of multiple states of the face image may be implemented respectively.
  • the target neural network may also include only one third network branch, and through the third network branch, a certain state can be detected for the face image, and multiple face images can also be detected. Status detection, etc.
  • the position of the third network branch in the target neural network is not limited in the embodiments of the present disclosure, and can be flexibly set according to actual conditions, and is not limited to the following disclosed embodiments.
  • the third network branch may be connected to the output of the first network branch.
  • the third network branch may also be connected to the feature extraction layer of the first network branch.
  • the third network branch may also be connected to some or some of the second network branches, and so on.
  • the target neural network including at least one third network branch, on the one hand, it is possible to further perform state detection on the face image, so as to assist in judging the accuracy of the obtained face key point information set; on the other hand, it can further It realizes end-to-end key point detection, and at the same time facilitates the introduction of new detection models to realize end-to-end face state detection, etc.
  • the method proposed in the embodiment of the present disclosure may further include:
  • Step S13 according to the face key point information set, track the face in the face image frame sequence where the face image is located.
  • the implementation manner of how to use the detected key points to track the face image can be flexibly determined according to the actual situation, and is not limited to the following disclosed embodiments.
  • step S13 may include:
  • the corrected target face image is input into the target neural network, and the same object in the face image and the target face image is tracked according to the output of the target neural network.
  • the objects in the face image frame sequence may be objects corresponding to the faces contained in the face image.
  • a face image may contain multiple faces, so the key point detection method proposed in the embodiments of the present disclosure can track a single object or multiple objects.
  • the target key point may be a key point used to determine the position of the object in the tracking process. Which points are specifically selected as target key points are not limited in the embodiments of the present disclosure, and are not limited to the following disclosed embodiments.
  • the first key point of the face can be used as the target key point.
  • the second key point of the face organ can also be used as the target key point.
  • the first key point of the face and the second key point of the face organ may also be used as target key points.
  • part of the key points in the second key point information of the face organs can also be replaced with the first key point information of the face according to the actual tracking requirements to obtain the target key points.
  • the next frame of the face image in the face image frame sequence can be used as the target face image, and based on the target key points, the target face image is corrected to obtain the corrected image target face image.
  • the correction method can be flexibly selected according to the actual situation. For details, please refer to the following disclosed embodiments, which will not be expanded here.
  • the target face image is the next frame of the face image in the face image frame sequence
  • the face in the target face image may move (translation or rotation, etc.). If the movement of the target face image relative to the face image is large, directly inputting the target face image to the target neural network for processing may fail to detect the first key point information of the face in the target face image or the human The second key point information of face organs, etc. Therefore, in a possible implementation manner, at least one target key point corresponding to the face image can be used to correct the target face image, so that the corrected target face image can have a more accurate accuracy in the target neural network. The key point detection results, so that the tracking can be continued, and the continuity and accuracy of the tracking can be improved.
  • the corrected target face image can be used as a new face image, input to the target neural network, and processed by the key point detection methods proposed in the above disclosed embodiments to obtain The corresponding first key point information of the face and/or the second key point information of the face parts, and then the target key point of the new face image is determined.
  • the position change process of the object can be determined, and the tracking of the object can be realized.
  • the target key points By determining the target key points according to the first key point information of the face and/or the second key point information of the face organs, and correcting the target face image according to the target key points, based on the corrected target face image
  • the same object in the face image and the target face image is tracked, and in the face image frame sequence where the face image is located, based on the key point detection results of the current frame, the next The frame images are pre-corrected, so as to improve the feasibility and accuracy of key point detection for each frame image in the face image frame sequence, thereby improving the continuity and accuracy of tracking.
  • the correction process can be flexibly determined according to the actual situation.
  • the target face image is corrected according to at least one target key point to obtain a corrected target face image, which may include:
  • the target face image is corrected in the direction of the preset template, and the corrected target face image is obtained.
  • the preset template may be a preset mean pose, and the specific pose of the face in the preset template may be flexibly set according to the actual situation, which is not limited in the embodiments of the present disclosure.
  • the target key point can reflect the posture of the face in the face image
  • the movement of the face in the face image relative to the face in the preset template can be determined by calculating according to at least one target key point and the preset template.
  • the movement can be represented in the form of an affine transformation matrix. How to calculate and obtain the affine transformation matrix can be determined according to the actual situation of the preset target and the target key points, which is not limited in the embodiment of the present disclosure.
  • the face in the target face image is further moved relative to the face in the face image.
  • the face in the target face image can be moved to a position that is more different from the face pose in the preset template. In the case of closeness, that is, the face in the obtained corrected target face image is closer to the face direction in the preset template.
  • the corrected target face image is input into the target neural network, on the one hand, it is easier to obtain the key point detection result, especially in the case where the face image in the face image frame sequence has a large angular offset In this case, the success rate of key point detection can be improved through correction, and on the other hand, the accuracy of the obtained key point detection results can also be improved, thereby improving the tracking accuracy.
  • the face image in the embodiment of the present disclosure may also be a face image including key point annotations.
  • the key point detection method proposed in the embodiments of the present disclosure can be implemented based on a target neural network. Therefore, in a possible implementation manner, it can also be implemented based on a human face marked with key points. image to train the target neural network. In this case, the methods proposed in the embodiments of the present disclosure can be used to train the target neural network.
  • the key point annotations may include the first key point information annotation of the face and/or the second key point information annotation of the face organs.
  • the first key point information annotation of the face may be an annotation on the actual position of the first key point information of the face in the face image
  • the second key point information annotation of the face organ may be the annotation of the face image.
  • the labeling manner is not limited in the embodiments of the present disclosure.
  • the first key point information of the human face and the second key point information of the face parts in the human face image may be manually annotated.
  • a machine may also be used to automatically label the first key point information of the human face and the second key point information of the face organs in the human face image.
  • the face image may further include an annotation of the face state corresponding to the third network branch.
  • the target neural network also includes a third network branch for detecting the open and closed state of eyes in the face, the open and closed state of the eyes can be performed on the face image according to the real open and closed conditions of the eyes in the face image. 's annotation.
  • FIG. 5 shows a flowchart of a keypoint detection method according to an embodiment of the present disclosure.
  • the key point detection method proposed in the embodiment of the present disclosure may include the following steps.
  • Step S11 acquiring a face image.
  • step S12 at least two neural network branches included in the target neural network are used to detect the face and at least one face organ in the face image, and obtain a set of key point information of the face, and the set of key point information of the face includes the face information set.
  • Step S14 Determine the error loss of the target neural network according to the key point annotation and the face key point information set.
  • Step S15 jointly update the parameters of at least two neural network branches in the target neural network according to the error loss.
  • steps S11 to S12 may refer to the above disclosed embodiments, which will not be repeated here.
  • step S14 can be used to determine the key point of prediction according to the actual positions of the first key point information of the face and the second key point information of the face organs marked in the face image.
  • step S15 the parameters in the first network branch and the second network branch are jointly updated by using the error loss.
  • step S14 the specific process of determining the error loss can be flexibly determined according to the actual situation.
  • each parameter in the target neural network can be reversely updated according to the error loss.
  • the target neural network in the embodiments of the present disclosure may include a first network branch and at least one second network branch. Therefore, in a possible implementation manner, in the target neural network
  • the parameter updates of the first network branch and the at least one second network branch can be performed simultaneously, that is, the parameters in the first network branch and the at least one second network branch can be jointly optimized according to the outputs of the two networks. , so that the target neural network obtained after training can achieve the global optimal effect in the process of detecting the first key point information of the face and the second key point information of the face organs.
  • the at least one third network branch may be trained together with the first network branch and the at least one second network branch, that is, at least one The parameters of the third network branch can be updated together with the first network branch and the at least one second network branch.
  • the at least one third network branch may also be trained independently, that is, in the process of updating the parameters of the at least one third network branch, the parameters of the first network branch and the second network branch may be fixed.
  • the first network branch and the at least one second network branch can be updated.
  • the branches are jointly trained, so that the detection results of the first key point information of the face obtained by the trained target neural network and the detection results of the second key point information of the face organs are consistent, and both have high detection accuracy. .
  • step S14 may be flexibly determined according to the actual situation.
  • step S14 may include at least one of the following processes:
  • the position annotation of the detection frame of at least one face part in the face image is determined, and according to the detection frame information of the at least one face part
  • the third error between the at least one detection frame position annotation of the face part determines the third error loss of the target neural network.
  • the face image may include an annotation of the first key point information of the face, and the annotation may indicate the actual position of the first key point information of the face in the training image. Therefore, in a possible implementation manner, it can be determined according to the first error formed between the first key point information of the face and the first key point information of the face predicted by the target neural network.
  • the error loss of the target neural network The specific error loss calculation method can be flexibly set according to the actual situation, which is not limited in the embodiments of the present disclosure.
  • the error of the target neural network can also be determined according to the second error formed between the second key point information of the face organ and the second key point information of the face organ predicted by the target neural network.
  • the calculation method of the loss can also be flexibly selected according to the actual situation.
  • the first network branch in the target neural network may further determine detection frame information of at least one facial organ.
  • each organ in the face can also be positioned, so as to calculate the position of the detection frame of each organ in the training image, as the human face. Annotation of the detection box location in the face image. Therefore, in a possible implementation manner, the third error formed between the detection frame information of each organ predicted by the target neural network and the position annotation of the detection frame of the corresponding organ can also be used to determine the target neural network. error loss.
  • the calculation method for the position labeling of the detection frame and the calculation method for determining the error loss of the target neural network according to the third error can be flexibly selected according to the actual situation, and are not limited in the embodiments of the present disclosure.
  • the above-mentioned methods for determining the error loss of the target neural network can be combined with each other. Which one or several methods are specifically selected to jointly determine the error loss of the target neural network can also be selected flexibly according to the actual situation, which is not limited in the embodiments of the present disclosure.
  • the target when the face image also includes face state annotations, can also be determined according to the error between the state detection results of the face image by the target neural network and the face state annotations Error loss for neural networks.
  • Determining the error loss of the target neural network through the above various processes can make the training process of the target neural network more flexible and rich, so that the trained target neural network has a better key point detection effect, and the obtained face
  • the first key point information and the second key point information of the face organs have higher consistency.
  • FIG. 6 shows a block diagram of a keypoint detection apparatus according to an embodiment of the present disclosure.
  • the key point detection device 20 may include:
  • the image acquisition module 21 is used for acquiring a face image.
  • the key point detection module 22 is used to detect the face and at least one face organ in the face image by using at least two neural network branches included in the target neural network to obtain a set of face key point information, and the face key point information
  • the set includes key first keypoint information of the face and second keypoint information of the at least one face part.
  • the at least two neural network branches include a first network branch for detecting human faces, and at least one second network branch for detecting at least one face part;
  • the key point detection module uses In: detecting the human face through the first network branch to obtain a first detection result, where the first detection result includes the first key point information of the human face and the detection frame information of at least one facial organ; based on the first detection result and at least one A second network branch detects at least one face part to obtain a second detection result, where the second detection result includes second key point information of the at least one face part.
  • the key point detection module is further configured to: for each second network branch in the at least one second network branch, use the second network branch corresponding to the face part of the second network branch The detection is performed to obtain a second detection result of the face organ corresponding to the second network branch.
  • the key point detection module when detecting the face organ corresponding to the second network branch, is used to: extract from the face image the person matching the detection frame information of the face organ face organ region; extracting second feature information of the face organ region; determining second key point information of the face organ based on the second feature information and the first detection result.
  • the first detection result further includes first feature information of the face image.
  • the key point detection module is used to: extract the initial feature of the face organ region matching the detection frame information of the face organ from the first feature information information; perform deep feature extraction on the initial feature information to obtain second feature information of the face organ region; determine the second key point information of the face organ based on the second feature information and the first detection result.
  • the key point detection module when determining the second key point information of the face organ based on the second feature information and the first detection result, is further configured to: combine the second feature information with the first feature The information and/or the first key point information are fused at least once to obtain fusion feature information; and the second key point information is obtained according to the fusion feature information.
  • the key point detection module is further configured to: detect at least one facial organ based on the first detection result and one less second network branch, and before obtaining the second detection result, detect at least one facial organ
  • the detection frame information of the face organs is subjected to enhancement processing, wherein the enhancement processing includes: scaling transformation processing and/or translation transformation processing.
  • the device is further configured to: acquire the second key points of the face parts that meet the preset precision in the second key point information of the at least one face part;
  • the second key point of the first key point information is replaced with the first key point of the face corresponding to the position of the face organ that meets the preset accuracy, and the updated first key point information of the face is obtained.
  • the apparatus is further configured to perform position transformation on the first key point information of the human face and/or the second key point information of each human face organ.
  • the target neural network further includes at least one third network branch, and the at least one third network branch is used to detect the face state according to the first key point information.
  • the apparatus is further configured to: track the face in the face image frame sequence where the face image is located according to the face key point information set.
  • the apparatus is further configured to: use the face image to correct the next frame of the face image in the face image frame sequence.
  • the first key point information includes 68 to 128 first key points.
  • the second key point information includes 40 to 80 key points of the mouth, 16 to 32 key points of the left eye, 16 to 32 key points of the right eye, and 10 to 20 key points of the right eye. Keypoints for the left eyebrow, and/or, 10 to 20 keypoints for the right eyebrow.
  • the face image includes key point annotations; the device is further configured to: determine the error loss of the target neural network according to the key point annotation and the face key point information set; The parameters of at least two branches of the neural network are jointly updated.
  • the application example of the present disclosure proposes a key point detection method, which can perform key point detection on a face image.
  • FIG. 3 and 7 respectively show a schematic diagram of a key point detection method according to an application example of the present disclosure, wherein FIG. 3 is a schematic diagram of an application process of the key point detection method, and FIG. 7 is a schematic diagram of a training process of the key point detection method.
  • the key point detection method may include the following processes.
  • the acquired face image is input into the target neural network, it is processed through the first network branch and five second network branches in the target neural network respectively.
  • the first network branch includes a shallow feature extraction module and a main module connected in sequence.
  • the shallow feature extraction module is based on the shallow feature extraction network structure block 0 described in the above disclosed embodiments, and performs preliminary extraction of the first feature information of the face image.
  • the main module like the deep feature extraction network structure block main described in the above disclosed embodiments, further extracts and returns the first feature information of the face image.
  • the first network branch processes the face image, it can output the first key point information of 106 faces (that is, the overall key points of the 106 faces in the figure), and each The detection box information of the face parts.
  • the five second network branches are independent of each other, and respectively perform second key point detection on the mouth, left eye, right eye, left eyebrow and right eyebrow in the human face.
  • the second network branch of the mouth includes a region of interest calibration layer (ROI Align) connected in sequence and a mouth feature extraction module.
  • ROI Align region of interest calibration layer
  • the region of interest calibration layer can cut the face image according to the mouth detection frame information output by the first network branch, so as to obtain a mouth region conforming to the preset image size.
  • the mouth feature extraction module may include one or more network layers for feature extraction, and may perform feature extraction on the mouth region to obtain second feature information of the mouth region. It can be seen from FIG. 3 that, in an example, the second feature information of the mouth region can be fused with the first key point information of the 106 faces outputted in the first network branch to obtain fused feature information. The fusion feature information is regressed through the second network branch of the mouth, and the first key point information of the mouth can be output.
  • the second network branch of the mouth extracts the mouth region from the face image according to the input mouth detection frame information, and compares the second feature information of the mouth region with the face
  • the information of the first key point of the mouth is fused to obtain the fusion feature information, and the second key point information of 64 mouths can be output based on the fusion feature information.
  • the second key point information of the 64 mouths and the first key point information of the 106 faces can be unified into the same position coordinate system by means of the position conversion mentioned in the above disclosed embodiments. Down.
  • the second network branch of the left eye can output 24 pieces of second key point information of the left eye
  • the second network branch of the right eye can output 24 pieces of second key point information of the right eye.
  • the implementation form of the second network branch of the left eyebrow is similar to the second network branch of the mouth described above.
  • the second network branch of the left eyebrow can use the region of interest calibration layer, based on the detection frame information of the left eyebrow, from the first network branch block 0 output face
  • the initial feature information of the left eyebrow area is extracted, and based on the initial feature information, deep feature extraction is performed to obtain the second feature information of the left eyebrow area.
  • the rest of the process is the same as the second network branch of the mouth.
  • the second network branch of the right eyebrow reference may be made to the left eyebrow, which will not be repeated here.
  • the second network branch of the left eyebrow can output the second key point information of 13 left eyebrows
  • the second network branch of the right eyebrow can output the thirteenth information of the right eyebrows. Two key points of information.
  • the second network branch in order to align the accuracy of the second key point information of the face organs, can output 64 second key point information of the mouth, 24 left eye In addition to the second key point information, 24 second key point information of the right eye, 13 second key point information of the left eyebrow, and 13 second key point information of the right eyebrow, you can also output some second key point information related to the face.
  • the second key point information of the face organs corresponding to the first key point information position of the face can be replaced into the first key point information of the 106 faces output by the first network branch to obtain the final 106 A key point of information.
  • the target neural network in the application example of the present disclosure may also include at least one third network branch to detect the state of the human face in the face image.
  • the target neural network in the application example of the present disclosure may also include at least one third network branch to detect the state of the human face in the face image.
  • a single target neural network can be used to obtain the detection results of the first key point information of the face and the second key point information of each face organ at the same time, and the region of interest calibration layer ROI Align can be used to detect the face organ region. Extraction not only saves the processing time in the whole process, reduces the total time consumption of key point detection, but also improves the accuracy of the obtained face organ area, and then improves the detected face organs.
  • the second key point information accuracy At the same time, in each second network branch, the second feature information of the face organ region can be fused with the first key point information of the face output by the first network branch to obtain the fusion feature information, so as to obtain the fusion feature information according to the fusion feature information.
  • the output of the second key point information of the face part through the above process, can make the second key point information of the face part more unified with the first key point information of the face, thereby improving the accuracy of key point detection.
  • the method proposed by the application example of the present disclosure can pass through the target neural network
  • the method proposed by the application example of the present disclosure can also be used in the training process of the target neural network.
  • the process of training the target neural network is basically the same as the above-mentioned application process, the difference is that in the training process, the face image contains the true value annotation of each key point, and the detection of the output of the first network branch After the frame information has undergone enhancement processing, it is input to each of the second network branches.
  • the enhancement processing method reference may be made to the above disclosed embodiments, and details are not described herein again.
  • the position of the detection frame of the left eyebrow and the right eyebrow can be calculated according to the ground truth annotation of each key point in the training image to obtain the left eyebrow and the true value detection frame of the right eyebrow (ie, the detection frame (true value) in the figure), and input them into the corresponding second network branch.
  • the training image may also include ground truth annotations of the face states detected by the third network branch.
  • the first network branch and each second network branch can be trained at the same time, and jointly perform parameter optimization, so as to achieve the global optimum.
  • end-to-end global optimization of the entire target neural network can be achieved, thereby improving the keypoint detection accuracy of the target neural network.
  • the key point detection method proposed in the application example of the present disclosure can not only be applied to the key point detection of face images, but also can be extended to the processing of other images, such as human images, bone images, and the like.
  • the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the foregoing method is implemented.
  • the computer-readable storage medium may be a volatile computer-readable storage medium or a non-volatile computer-readable storage medium.
  • An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method.
  • the above-mentioned memory can be a volatile memory (volatile memory), such as RAM; or a non-volatile memory (non-volatile memory), such as ROM, flash memory (flash memory), hard disk (Hard Disk Drive) , HDD) or solid-state drive (Solid-State Drive, SSD); or a combination of the above types of memory, and provide instructions and data to the processor.
  • volatile memory such as RAM
  • non-volatile memory such as ROM, flash memory (flash memory), hard disk (Hard Disk Drive) , HDD) or solid-state drive (Solid-State Drive, SSD); or a combination of the above types of memory, and provide instructions and data to the processor.
  • the above-mentioned processor may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It can be understood that, for different devices, other electronic devices may also be used to implement the functions of the foregoing processors, which are not specifically limited in the embodiments of the present disclosure.
  • the electronic device may be provided as a terminal, server or other form of device.
  • an embodiment of the present disclosure further provides a computer program, which implements the above method when the computer program is executed by a processor.
  • FIG. 8 is a block diagram of an electronic device 800 according to an embodiment of the present disclosure.
  • electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, etc. terminal.
  • an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814 , and the communication component 816 .
  • the processing component 802 generally controls the overall operation of the electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 802 can include one or more processors 820 to execute instructions to perform all or some of the steps of the methods described above.
  • processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components.
  • processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
  • Memory 804 is configured to store various types of data to support operation at electronic device 800 . Examples of such data include instructions for any application or method operating on electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • Power supply assembly 806 provides power to various components of electronic device 800 .
  • Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 800 .
  • Multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action.
  • multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
  • Audio component 810 is configured to output and/or input audio signals.
  • audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when electronic device 800 is in operating modes, such as calling mode, recording mode, and voice recognition mode.
  • the received audio signal may be further stored in memory 804 or transmitted via communication component 816 .
  • audio component 810 also includes a speaker for outputting audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
  • Sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of electronic device 800 .
  • the sensor assembly 814 can detect the on/off state of the electronic device 800, the relative positioning of the components, such as the display and the keypad of the electronic device 800, the sensor assembly 814 can also detect the electronic device 800 or one of the electronic device 800 Changes in the position of components, presence or absence of user contact with the electronic device 800 , orientation or acceleration/deceleration of the electronic device 800 and changes in the temperature of the electronic device 800 .
  • Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 816 is configured to facilitate wired or wireless communication between electronic device 800 and other devices.
  • Electronic device 800 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 816 receives broadcast signals or broadcast related personnel information from an external broadcast management system via a broadcast channel.
  • the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmed gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A programmed gate array
  • controller microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • a non-volatile computer-readable storage medium such as a memory 804 comprising computer program instructions executable by the processor 820 of the electronic device 800 to perform the above method is also provided.
  • FIG. 9 is a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
  • the electronic device 1900 may be provided as a server.
  • electronic device 1900 includes processing component 1922, which further includes one or more processors, and a memory resource represented by memory 1932 for storing instructions executable by processing component 1922, such as applications.
  • An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the above-described methods.
  • the electronic device 1900 may also include a power supply assembly 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input output (I/O) interface 1958 .
  • Electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
  • a non-volatile computer-readable storage medium such as memory 1932 comprising computer program instructions executable by processing component 1922 of electronic device 1900 to perform the above-described method.
  • the present disclosure may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present disclosure.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • Source or object code written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing state personnel information of computer readable program instructions.
  • Computer readable program instructions can be executed to implement various aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种关键点检测方法及装置、电子设备和存储介质。所述方法包括:获取人脸图像;利用目标神经网络包括的至少两个神经网络分支,对人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,人脸关键点信息集合包括该人脸的第一关键点信息以及该至少一个人脸器官的第二关键点信息。

Description

关键点检测方法及装置、电子设备和存储介质
交叉引用声明
本发明要求于2020年12月29日提交中国专利局的申请号为202011596380.1的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机视觉领域,尤其涉及一种关键点检测方法及装置、电子设备和存储介质。
背景技术
人脸关键点检测是诸多人脸相关应用的基础,可以为人脸识别等技术提供位置矫正,也为增强现实、美妆特效等场景提供人脸的语义信息。因此如何检测人脸关键点,成为目前一个亟待解决的问题。
相关方法中,会在获取整脸关键点以后,再基于整脸关键点,通过单独的模型来获取具有更高精度的人脸器官关键点,从而提升关键点检测的精度。然而,这种分别进行关键点获取的过程不仅繁琐,也容易使得得到的关键点精度较低。
发明内容
本公开提出了一种关键点检测方案。
根据本公开的一方面,提供了一种关键点检测方法,包括:
获取人脸图像;利用目标神经网络包括的至少两个神经网络分支,对所述人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,所述人脸关键点信息集合包括所述人脸的第一关键点信息以及所述至少一个人脸器官的第二关键点信息。
根据本公开的一方面,提供了一种关键点检测装置,包括:
图像获取模块,用于获取人脸图像;关键点检测模块,用于利用目标神经网络包括的至少两个神经网络分支,对所述人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,所述人脸关键点信息集合包括所述人脸的第一关键点信息以及所述至少一个人脸器官的第二关键点信息。
根据本公开的一方面,提供了一种电子设备,包括:
处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:执行上述关键点检测方法。
根据本公开的一方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述关键点检测方法。
在本公开实施例中,通过获取人脸图像,并利用目标神经网络包括的至少两个神经网络分支,对人脸图像中人脸以及至少一个人脸器官进行检测,来得到包括人脸的第一关键点信息以及至少一个人脸器官的第二关键点信息的人脸关键点信息集合。通过上述过程,可以通过目标神经网络对人脸图像进行端到端的处理,从而同时得到具有较高精度的人脸的第一关键点信息以及人脸器官的第二关键点信息,提升了关键点检测的便捷性与精度。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1示出根据本公开一实施例的关键点检测方法的流程图。
图2示出根据本公开一实施例的关键点检测方法的流程图。
图3示出根据本公开一应用示例的关键点检测方法示意图。
图4示出根据本公开一应用示例的关键点检测方法示意图。
图5示出根据本公开一实施例的关键点检测方法的流程图。
图6示出根据本公开一实施例的关键点检测装置的框图。
图7示出根据本公开一应用示例的关键点检测方法示意图。
图8示出根据本公开实施例的一种电子设备的框图。
图9示出根据本公开实施例的一种电子设备的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
另外,为了更好地说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1示出根据本公开一实施例的关键点检测方法的流程图,该方法可以应用于关键点检测装置,关键点检测装置可以为终端设备、服务器或者其他处理设备等。其中,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一个示例中,该关键点检测方法可以应用于云端服务器或本地服务器,云端服务器可以为公有云服务器,也可以为私有云服务器,根据实际情况灵活选择即可。
在一些可能的实现方式中,该关键点检测方法也可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
如图1所示,在一种可能的实现方式中,所述关键点检测方法可以包括:
步骤S11,获取人脸图像。
步骤S12,利用目标神经网络包括的至少两个神经网络分支,对人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,人脸关键点信息集合包括人脸的第一关键点信息以及至少一个人脸器官的第二关键点信息。
人脸图像可以是包含有人脸的图像帧,实现形式可以根据实际情况灵活决定,在本公开实施例中不做限制。人脸图像中包含的人脸数量在本公开实施例中也不做限制,可以包含某个对象的人脸,也可以同时包含多个对象的人脸。在人脸图像包含多个人脸的情况下,通过本公开实施例中提出的关键点检测方法,可以分别得到每个对象的人脸对应的人脸的第一关键点信息和/或至少一个人脸器官的第二关键点信息等。后续各公开实 施例均以人脸图像包括单个对象的人脸为例进行说明。人脸图像中包含多个对象的人脸情况,可以参考后续各公开实施例进行灵活扩展,不再一一列举赘述。
人脸图像的获取方式在本公开实施例中也不做限定,比如可以对存储有人脸图像的数据库进行读取来得到人脸图像,或是在某些场景下对人脸图像进行采集来得到人脸图像,亦或是从包含有人脸的视频中进行片段截取或采样等方式来得到人脸图像等,具体如何获取可以根据实际情况灵活决定。
在一种可能的实现方式中,人脸图像可以是从人脸图像帧序列中获取的图像。人脸图像帧序列可以是包含多帧人脸图像的帧序列,其实现方式在本公开实施例中不做限制。从人脸图像帧序列中获取人脸图像的方式在本公开实施例中也不做限定,可以是从人脸图像帧序列中随机进行采样来获取人脸图像,也可以是根据预设要求从人脸图像帧序列中选定人脸图像。
步骤S11中获取的人脸图像的数量在本公开实施例中不做限定,可以根据实际情况灵活选择。后续各公开实施例均以获取单个人脸图像为例进行说明,获取多个人脸图像后的处理方式,可以参考后续各公开实施例进行灵活扩展,在本公开实施例中不再赘述。
目标神经网络可以是对人脸图像进行处理的神经网络,其实现形式可以根据实际情况灵活决定。如上述公开实施例所述,在一种可能的实现方式中,目标神经网络可以包括至少两个神经网络分支。其中,目标神经网络包含的神经网络分支的数量、各神经网络分支的实现形式以及连接关系等,均可以根据实际情况灵活决定,详见下述各公开实施例,在此先不做展开。
通过上述公开实施例还可以看出,利用目标神经网络包括的至少两个神经网络分支,可以对人脸图像中人脸以及至少一个人脸器官进行处理,来得到人脸关键点信息集合。其中,人脸关键点信息集合可以是人脸图像中包含的人脸关键点的相关信息,包含哪些信息内容可以根据实际情况灵活决定。在一种可能的实现方式中,人脸关键点信息集合可以包括人脸的第一关键点信息以及至少一个人脸器官的第二关键点信息。
其中,人脸的第一关键点信息可以包括:对人脸中各部位或器官进行定位,从而确定人脸整体情况的关键点,比如人脸中眼睛、嘴巴、鼻子或是眉毛等器官上的关键点,以及人脸面颊、额头或下巴等部位上的关键点等。
人脸的第一关键点信息中第一关键点的数量以及具体包含人脸中的哪些关键点,这些情况均可以根据实际情况进行灵活设定,在本公开实施例中不做限制。在一种可能的实现方式中,人脸的第一关键点的数量可以在68至128个这一区间内,设定的具体数量可以根据实际情况灵活选择。在一个示例中,可以设定人脸的第一关键点的数量为106个,记为Face 106。在这种情况下,目标神经网络中的神经网络分支可以对输入的人脸图像进行人脸的第一关键点信息检测,从而输出106个人脸的第一关键点的信息。
人脸器官的第二关键点信息可以包括:人脸中各部位或器官中所包含的,可以用于确定人脸局部部位或器官情况的关键点,比如人脸中眼睛、嘴巴、鼻子或是眉毛等器官中所包含的关键点等。相比于上述公开实施例中提到的人脸的第一关键点信息来说,人脸器官的第二关键点信息可能与人脸的第一关键点信息描述同一人脸上的部位或器官,但是由于人脸的第一关键点信息用于确定人脸的整体情况,而人脸器官的第二关键点信息用于确定该部位或器官的局部情况,因此人脸器官的第二关键点信息相对于描述同一部位或器官的人脸的第一关键点信息来说,关键点的数量可能更多,在相应部位或器官上的分布也可能更加密集。
在一种可能的实现方式中,人脸器官的第二关键点的数量以及具体包含人脸中哪些器官或部位的关键点,这些情况均可以根据实际情况进行灵活设定,在本公开实施例中不做限制。在一种可能的实现方式中,嘴部的关键点的数量可以在40至80个这一区间内,左眼的关键点数量可以在16至32个这一区间内,右眼的关键点数量也可以在16 至32个这一区间内,左眉毛的关键点的数量可以在10至20个这一区间内,右眉毛的关键点数量也可以在10至20个这一区间内,上述各器官的关键点设定的具体数量可以根据实际情况灵活选择。在一个示例中,可以设定人脸中嘴部的关键点数量为64个,记为mouth 64,左眼和右眼的关键点数量各为24个,均记为eye 24,左眉毛和右眉毛的关键点数量各为13个,均记为eyebrow 13。
在本公开实施例中,通过获取人脸图像,并利用目标神经网络包括的至少两个神经网络分支,对人脸图像中人脸以及至少一个人脸器官进行处理,来得到包括人脸的第一关键点信息以及人脸器官的第二关键点信息的人脸关键点信息集合。通过上述过程,可以通过目标神经网络对人脸图像进行端到端的处理,从而同时得到具有较高精度的人脸的第一关键点信息以及人脸器官的第二关键点信息,提升了关键点检测的便捷性与精度。
如上述各公开实施例所述,目标神经网络中包含的神经网络分支可以根据实际情况灵活决定。在一种可能的实现方式中,至少两个神经网络分支可以包括用于检测人脸的第一网络分支,以及,用于检测至少一个人脸器官的至少一个第二网络分支。
图2示出根据本公开一实施例的关键点检测方法的流程图。如图2所示,在一种可能的实现方式中,步骤S12可以包括:
步骤S121,通过第一网络分支对人脸进行检测,得到第一检测结果,第一检测结果包括人脸的第一关键点信息以及至少一个人脸器官的检测框信息;
步骤S122,基于第一检测结果,通过至少一个第二网络分支对至少一个人脸器官进行检测,得到第二检测结果,第二检测结果包括至少一个人脸器官的第二关键点信息。
其中,第一网络分支可以是对人脸图像中的人脸进行检测,从而得到人脸的第一关键点信息的网络结构,其实现形式可以根据实际情况灵活决定,详见下述各公开实施例,在此先不做展开。
第一检测结果可以是第一网络分支对人脸图像中的人脸进行检测,所得到的检测结果。在一种可能的实现方式中,第一检测结果可以包括人脸的第一关键点信息以及至少一个人脸器官的检测框信息。在一些可能的实现方式中,第一检测结果还可以包括其他的信息,比如还可以包括对人脸图像中的人脸进行特征提取所得到的人脸图像的第一特征信息,该人脸图像的第一特征信息可以反映人脸图像中人脸的整体特征,人脸图像的第一特征信息的具体实现形式可以详见下述各公开实施例,在此先不做展开。
第二网络分支可以是对人脸图像中人脸的各器官进行检测,从而得到人脸器官的第二关键点信息的网络结构,其实现形式同样可以根据实际情况灵活决定,详见下述各公开实施例,在此先不做展开。
第二检测结果可以是至少一个第二网络分支基于第一检测结果,对人脸图像中人脸的至少一个人脸器官进行检测,所得到的检测结果。在一种可能的实现方式中,第二检测结果可以包括至少一个人脸器官的第二关键点信息。由于第一检测结果是基于人脸图像中的人脸进行检测所得到的反映人脸整体情况的检测结果,基于第一检测结果,来确定第二检测结果中的人脸器官的第二关键点信息,可以使人脸器官的第二关键点信息与第一检测结果中人脸的第一关键点信息更为统一,从而使得人脸关键点信息集合更加精确。至少一个第二网络分支如何确定至少一个人脸器官的第二关键点信息,其检测的过程可以详见下述各公开实施例,在此先不做展开。
通过步骤S121和步骤S122所述的过程,可以利用目标神经网络同时得到人脸的第一关键点信息和至少一个人脸器官的第二关键点信息,一方面可以实现端到端的人脸关键点识别,通过目标神经网络提升了关键点识别的速度和效率,也可以提升后续利用关键点进行跟踪等处理的效率;另一方面由于各人脸器官的第二关键点信息可以基于第一检测结果来确定,因此可以利用反映人脸整体情况的第一检测结果对各人脸器官的第二关键点信息进行定位,从而使得得到的人脸关键点信息集合在人脸整体和局部器官中的 位置较为统一,且均具有较高的精度,有效提升了识别出的关键点的准确度。
在一种可能的实现方式中,步骤S121可以包括:
根据第一网络分支的输出,得到人脸的第一关键点信息以及至少一个人脸器官的检测框信息。
在一些实施例中,步骤S121还可以包括:根据第一网络分支中至少一个网络层输出的中间特征,得到人脸图像的第一特征信息。
其中,由于第一网络分支可以是对人脸图像进行人脸的第一关键点信息检测的网络结构,因此,在将人脸图像作为输入的情况下,在一种可能的实现方式中,可以根据第一网络分支的输出,来得到人脸的第一关键点信息。
至少一个人脸器官的检测框信息,可以是表示人脸图像中各器官位置的检测框。具体表达人脸图像中哪些器官的位置,可以根据实际情况灵活设定,在本公开实施例中不做限制。检测框的形状以及实现形式均可以根据实际情况灵活决定,不同器官的检测框的形状与实现形式可以相同,也可以不同。在一种可能的实现方式中,各器官的检测框可以均为矩形框,并可以通过矩形框中顶点的坐标来体现各检测框的位置。在一种可能的实现方式中,目标神经网络的第一网络分支在输出人脸的第一关键点信息的同时,也可以同时输出至少一个人脸器官的检测框信息。
人脸图像的第一特征信息可以是第一网络分支根据人脸图像中的人脸的整体所确定的,反映人脸整体情况的特征信息。该人脸图像的第一特征信息具体如何获取,以及包含哪些特征数据,其实现方式均可以根据实际情况灵活决定,不局限于下述各公开实施例。
如上述各公开实施例所述,第一网络分支的实现形式可以根据实际情况灵活决定。在一种可能的实现方式中,第一网络分支可以通过多个连接的网络层所构成,这些网络层的具体实现形式以及组合形式可以根据实际情况灵活决定,不局限于下述各公开实施例。
在一种可能的实现方式中,第一网络分支可以包括依次连接的浅层特征提取网络结构block 0和深层特征提取网络结构block main,block 0和block main中可以分别包含多个如卷积层、池化层或是分类层等形式的网络层。通过block 0,第一网络分支可以对人脸图像的特征信息进行初步提取,得到的初步提取特征可以输入至block main中进行人脸图像的特征信息的进一步提取,并基于进一步提取的特征信息进行回归,从而得到第一网络分支的输出,来实现人脸的第一关键点信息的检测。相应的,在一个示例中,block 0提取的初步提取特征可以作为人脸图像的第一特征信息;在一个示例中,block main提取的深层特征可以作为人脸图像的第一特征信息;在一个示例中,block 0提取的初步提取特征以及block main进一步提取的深层特征均可以作为人脸图像的第一特征信息,具体选择哪些特征作为人脸图像的第一特征信息,可以根据实际情况灵活决定,在本公开实施例中不做限制。
需要注意的是,在本公开实施例中,对以上步骤的实现顺序不做限定,可以根据实际情况灵活决定。在一种可能的实现方式中,可以同时得到人脸的第一关键点信息、至少一个人脸器官的检测框以及人脸的整体特征,也可以按照某种顺序分别得到人脸的第一关键点信息、至少一个人脸器官的检测框以及人脸的整体特征等。
步骤S122的实现形式可以根据实际情况灵活决定。在一种可能的实现方式中,步骤S122可以包括:对于至少一个第二网络分支中的每一第二网络分支,利用该第二网络分支对该第二网络分支对应的人脸器官进行检测,得到该第二网络分支对应的人脸器官的第二检测结果。
通过上述公开实施例可以看出,在一种可能的实现方式中,人脸中不同器官所包含的人脸器官的第二关键点信息,可以通过目标神经网络包括的多个第二网络分支分别来 进行检测,比如嘴部的第二关键点信息可以通过嘴部的第二网络分支进行检测,眼睛的第二关键点信息可以通过眼睛的第二网络分支进行检测等。因此在一种可能的实现方式中,目标神经网络包含的第二网络分支的数量以及第二网络分支用于检测的器官,其实现形式均可以根据实际情况灵活决定。在目标神经网络包括多个第二网络分支的情况下,各第二网络分支相互独立,即各第二网络分支执行检测的顺序以及过程等均不受其他第二网络分支的检测过程或数据的干扰。在一些可能的实现方式中,各第二网络分支的结构可以相同,也可以不同,在本公开实施例中不做限制。在一些可能的实现方式中,各第二网络分支中的检测过程可以相同,也可以不同,在本公开实施例中同样不做限制。
通过利用至少一个第二网络分支中的每一第二网络分支,对该第二网络分支对应的人脸器官进行检测,得到该第二网络分支对应的人脸器官的第二检测结果。通过上述过程,可以利用多个第二网络分支,对人脸图像中人脸的多个器官分别独立地进行人脸器官检测,提高关键点检测的效率和灵活性。
在一种可能的实现方式中,对每一第二网络分支对应的人脸器官进行检测,得到该第二网络分支对应的人脸器官的第二检测结果可以包括:
从人脸图像中提取出与该第二网络分支对应的人脸器官的检测框信息匹配的人脸器官区域;
提取人脸器官区域的第二特征信息;
基于第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息。
人脸器官区域可以是人脸图像中包括人脸器官的区域,如上述各公开实施例所述。人脸器官的检测框信息,可以是表示人脸图像中各器官位置的检测框。在一种可能的实现方式中,基于人脸器官的检测框信息,可以确定人脸器官在人脸图像中的位置,因此,可以根据人脸器官的检测框信息,从人脸图像中提取出匹配的人脸器官区域。
提取人脸器官区域的方式在本公开实施例中不做限制,在一种可能的实现方式中,可以根据人脸器官的检测框信息所确定的位置,对人脸图像进行剪切,来得到人脸器官区域。举例来说,在一个示例中,可以利用嘴部器官的检测框,在进行嘴部关键点检测的第二网络分支中,对人脸图像进行剪切,得到嘴部的人脸器官区域。其余器官如眼睛或是眉毛等的剪切形式可以参考上述公开实施例,在此不再赘述。剪切的方式可以根据实际情况灵活决定,详见下述各公开实施例,在此先不做展开。
在得到人脸器官区域以后,第二网络分支可以进一步提取人脸器官区域的第二特征信息。其中,人脸器官区域的第二特征信息可以是反映人脸器官局部情况的特征信息,其提取方式可以根据实际情况灵活决定,不局限于下述各公开实施例。在一种可能的实现方式中,可以通过第二网络分支中具有特征提取功能的网络层,对人脸器官区域进行浅层特征提取和/或深层特征提取,并将提取得到的人脸器官区域的浅层特征和/或深层特征作为人脸器官区域的第二特征信息等。
在得到人脸器官区域的第二特征信息以后,可以对人脸器官区域的第二特征信息进行处理,来得到人脸器官的第二关键点信息。对人脸器官区域的第二特征信息进行处理的方式在本公开实施例中不做限定。在一种可能的实现方式中,可以通过第二网络分支中具有关键点计算或识别功能的网络层,对人脸器官区域的第二特征信息进行回归计算,得到人脸器官的第二关键点信息。在一种可能的实现方式中,也可以通过第二网络分支中具有关键点计算或识别功能的网络层,对人脸器官区域的第二特征信息,结合第一检测结果中包含的部分或全部信息来进行回归计算,从而确定人脸器官的第二关键点信息。具体如何根据人脸器官区域的第二特征信息以及第一检测结果来确定人脸器官的第二关键点信息,其实现形式可以详见下述各公开实施例,在此先不做展开。
通过从人脸图像中提取出与每个人脸器官的检测框信息匹配的人脸器官区域,并提取人脸器官区域的第二特征信息,从而基于人脸器官区域的第二特征信息以及第一检测 结果,确定人脸器官的第二关键点信息,可以基于提取的人脸器官区域来进行人脸器官的第二关键点信息的检测,有效提升人脸器官的第二关键点的检测精度,从而提升关键点检测方法的整体检测精度。而且,提取的人脸器官区域与第一检测结果包括的人脸器官的检测框信息匹配,因此得到的人脸器官的第二关键点信息与第一网络分支输出的人脸的第一关键点信息具有关联性,进一步提升了目标神经网络对人脸关键点检测的整体准确性。同时,人脸器官的第二关键点信息基于人脸器官区域的第二特征信息以及第一检测结果所确定,由于第一检测结果是对人脸图像中的人脸进行检测所得到的结果,因此基于第一检测结果可以对人脸图像中各器官的位置进行相应的定位参考,使得确定的人脸器官的第二关键点信息可以具有更高的精度,且与人脸的第一关键点信息的位置可以更加统一,因此更加提升关键点的检测精度。
如上述各公开实施例所述,第一检测结果可以包括人脸图像的第一特征信息,因此,在一种可能的实现方式中,对每一第二网络分支对应的人脸器官进行检测,得到该第二网络分支对应的人脸器官的第二检测结果也可以包括:
从人脸图像的第一特征信息中提取出与该第二网络分支对应的人脸器官的检测框信息匹配的人脸器官区域的初始特征信息;
对初始特征信息进行深层特征提取,得到人脸器官区域的第二特征信息;
基于人脸器官区域的第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息。
人脸器官区域的初始特征信息,可以是人脸图像的第一特征信息中与人脸器官的区域相关的特征信息,如上述各公开实施例所述。人脸器官的检测框信息,可以是表示人脸图像中各器官位置的检测框。在一种可能的实现方式中,基于人脸器官的检测框信息,可以确定人脸器官区域在人脸图像的第一特征信息中的位置,因此,可以根据人脸器官的检测框信息,从人脸图像的第一特征信息中提取出匹配的人脸器官区域的初始特征信息。
提取人脸器官区域的初始特征信息的方式在本公开实施例中不做限制。在一种可能的实现方式中,在人脸图像的第一特征信息包括人脸图像的特征图的情况下,可以根据人脸器官的检测框信息所确定的位置,对人脸图像的特征图进行剪切,来得到人脸器官区域的初始特征信息。举例来说,在一个示例中,可以利用左眼器官的检测框,在进行左眼关键点检测的第二网络分支中,对人脸图像的特征图进行剪切,得到左眼的人脸器官区域的初始特征信息。其余器官如嘴巴或是眉毛等的剪切形式可以参考上述公开实施例,在此不再赘述。剪切的方式可以根据实际情况灵活决定,详见下述各公开实施例,在此先不做展开。
在得到人脸器官区域的初始特征信息以后,第二网络分支可以进一步对初始特征信息进行深层特征提取,得到人脸器官区域的第二特征信息。其中,人脸器官区域的第二特征信息的实现形式可以详见上述各公开实施例,在此不再赘述。对初始特征信息进行深层特征提取的方式在本公开实施例中也不做限制。在一些可能的实现方式中,可以通过第二网络分支中具有特征提取功能的网络层,对人脸器官区域的初始特征信息进行进一步地深层特征提取,并将提取得到的人脸器官区域的深层特征作为人脸器官区域的第二特征信息等。
在得到人脸器官区域的第二特征信息以后,可以基于人脸器官区域的第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息,这一过程的具体实现方式可以参考上述公开实施例,在此不再赘述。
通过从人脸图像的第一特征信息中提取出与每个人脸器官的检测框信息匹配的人脸器官区域的初始特征信息,并对初始特征信息进行深层特征提取得到人脸器官区域的第二特征信息,从而基于人脸器官区域的第二特征信息以及第一检测结果,确定人脸器官 的第二关键点信息,可以使得第二网络分支和第一网络分支共享部分的特征提取的网络层结构。一方面可以减少特征提取的耗时,提高关键点检测的效率,另一方面还可以利用到人脸图像的第一特征信息中反映的人脸全脸的信息,提高确定的人脸器官的第二关键点信息的稳定性。
如上述公开实施例所述,提取人脸器官区域的方式可以根据实际情况灵活决定。在一种可能的实现方式中,从人脸图像中提取出与人脸器官的检测框信息匹配的人脸器官区域可以包括:
根据检测框信息,确定人脸器官在人脸图像中的位置坐标;
通过第二网络分支的感兴趣区域校准层,在位置坐标的精度下,提取与人脸器官的检测框信息匹配的人脸器官区域。
如上述各公开实施例所述,人脸器官的检测框信息,可以是表示人脸图像中各器官位置的检测框。因此,基于检测框信息,可以确定人脸器官在人脸图像中的位置坐标。
在确定人脸器官在人脸图像中的位置坐标以后,可以通过第二网络分支的感兴趣区域校准层,在位置坐标的精度下,提取人脸器官区域。
其中,位置坐标的精度可以是位置坐标的数值精度。在一个示例中,检测框信息中检测框的顶点坐标可以为浮点数,则基于检测框信息所确定的位置坐标的精度可以与检测框的顶点坐标的浮点位数一致。在一个示例中,检测框的顶点坐标可以为整数,则位置坐标的精度可以为整数。
感兴趣区域校准层(Region Of Interest Align,ROI Align)可以是具有图像剪切功能的网络层,其实现形式在本公开实施例中不做限制。通过上述公开实施例可以看出,在一种可能的实现方式中,感兴趣区域校准层在对人脸图像进行剪切的过程中,剪切的精度可以与人脸器官在人脸图像中的位置坐标的精度一致。因此,在一种可能的实现方式中,在位置坐标的精度为浮点数的情况下,感兴趣区域校准层可以是能在浮点精度下进行图像剪切的网络层,因此,任意具有该功能的网络层均可以作为感兴趣区域校准层的实现形式。
通过根据检测框信息,确定人脸器官在人脸图像中的位置坐标,并通过第二网络分支的感兴趣区域校准层,在位置坐标的精度下,提取与人脸器官的检测框信息匹配的人脸器官区域,可以有效提升提取的人脸器官区域的精度,继而提升基于该人脸器官区域所确定的人脸器官的第二关键点信息的精度,从而提升关键点检测的精度。
在一些可能的实现方式中,从人脸图像的第一特征信息中提取人脸器官区域的初始特征信息的方式,可以参考上述从人脸图像中提取人脸器官区域的方式来实现,在此不再赘述。
在一种可能的实现方式中,基于人脸器官区域的第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息,可以包括:
将人脸器官区域的第二特征信息与人脸图像的第一特征信息和/或人脸的第一关键点信息进行至少一次融合处理,得到融合特征信息;
根据融合特征信息,得到人脸器官的关键点信息。
如上述各公开实施例所述,人脸器官区域的第二特征信息,可以是第二网络分支中,对人脸器官区域和/或人脸器官区域的初始特征信息进行提取所得到的第二特征信息,其实现形式可以参考上述各公开实施例,在此不再赘述。
由于人脸器官区域是从人脸图像中提取的区域,提取的精度可能会影响人脸器官区域的第二特征信息的精度,从而进一步影响确定的人脸器官的第二关键点信息的精度。因此,在一种可能的实现方式中,为了提高确定的人脸器官的第二关键点信息的精度,可以将第一网络分支中所得到的第一检测结果中的人脸图像的第一特征信息和/或人脸 的第一关键点信息,与人脸器官区域的第二特征信息进行融合,来得到融合特征信息。由于人脸图像的第一特征信息和/或人脸的第一关键点信息是基于完整的人脸图像所得到的,可以体现出人脸的整体,因此根据融合特征信息来进一步回归得到的人脸器官的第二关键点信息,可以与得到的人脸的第一关键点信息具有较为统一的位置信息,因此具有较高的精度。
具体地,在进行特征融合的过程中,是与哪些对象进行融合,在本公开实施例中不做限制。
图3示出根据本公开一应用示例的示意图。如图所示,在一个示例中,经过各第二分支的特征提取模块(如嘴部特征提取模块、左眼特征提取模块以及右眉毛特征提取模块等)进行特征提取后,可以得到各人脸器官区域的第二特征信息,这些人脸器官区域的第二特征信息可以与图中第一分支输出的106人脸整体关键点(即人脸的第一关键点信息)分别进行一次融合,融合的形式可以为连接或其他形式等,来得到融合特征信息(图中未标明)。融合特征信息可以进一步通过计算得到各第二分支输出的人脸器官的第二关键点(如64个嘴部关键点、24个左眼关键点以及13个右眉毛关键点等)。
图4示出根据本公开一应用示例的示意图。如图所示,在一个示例中,经过各第二分支的特征提取模块(如嘴部特征提取模块、左眼特征提取模块以及右眉毛特征提取模块等)进行特征提取后,可以得到各人脸器官区域的第二特征信息,这些人脸器官区域的第二特征信息可以与图中第一分支的浅层特征提取模块所提取的人脸图像的第一特征信息分别进行一次融合,融合的形式可以为相加或其他形式等,来得到融合特征信息(图中未标明)。融合特征信息可以进一步通过计算得到各第二分支输出的人脸器官的第二关键点(如64个嘴部关键点、24个左眼关键点以及13个右眉毛关键点等)。
在一些可能的实现方式中,图3中的106人脸整体关键点(即人脸的第一关键点信息)与图4中的人脸图像的第一特征信息可以均与各人脸器官区域的第二特征信息进行融合。
在一些可能的实现方式中,人脸器官区域的第二特征信息还可以与人脸的第一关键点信息以及人脸图像的第一特征信息均进行融合。
上述各公开实施例还提出,人脸图像的第一特征信息可以包含第一网络分支中某个网络层或某些网络层所提取的特征等。因此,在将人脸器官区域的第二特征信息与人脸图像的第一特征信息进行融合的过程中,可以是与第一网络分支中任意网络层提取的特征进行融合。具体选择哪些网络层提取的特征,是深层特征还是浅层特征等,均可以根据实际情况灵活决定,在本公开实施例中不做限制。在获取融合特征信息的过程中,具体进行几次融合,可以根据融合的对象灵活决定,因此在本公开实施例中同样不做限制。
融合的方式也可以随着融合的对象不同而灵活发生变化,详见下述各公开实施例,在此先不做展开。
在得到融合特征信息以后,可以在第二网络分支中处理对应的融合特征信息,从而得到输出的人脸器官的第二关键点信息。其中,第二网络分支对融合特征信息的处理方式在本公开实施例中不做限制,可以根据实际情况灵活选择。在一种可能的实现方式中,可以通过回归层或是分类层等网络层对融合特征信息进行处理,来得到输出的人脸器官的第二关键点信息。在一种可能的实现方式中,也可以通过多个网络层所组成的网络结构对融合特征信息进行处理,来得到输出的人脸器官的第二关键点信息等。
通过将人脸器官区域的第二特征信息与人脸图像的第一特征信息和/或人脸的第一关键点信息进行至少一次融合处理,得到融合特征信息,并根据融合特征信息,得到人脸器官的第二关键点信息,可以将第一网络分支所得到的反映人脸整体情况的人脸图像的第一特征信息和/或人脸的第一关键点信息,应用到人脸器官的关键点的检测过程中, 从而使得得到的人脸器官的第二关键点信息与人脸的第一关键点信息结果相统一,具有更高的精度。
如上述公开实施例所述,融合的方式在本公开实施例中不做限制。在一种可能的实现方式中,融合处理可以包括以下操作中的至少一种:连接、相加、加权融合以及注意力特征融合。
其中,连接可以是将融合的对象直接进行拼接来实现融合;相加则可以是将融合的对象在对应像素上进行相加,来得到融合后的特征;加权融合可以是对融合的对象赋予一定的预设权重,从而根据预设权重进行相加来实现融合;注意力特征融合可以是根据注意力机制,对融合的对象通过连接以及跳跃连接等操作来实现融合。
通过上述过程,可以将人脸器官区域的第二特征信息与人脸图像的第一特征信息和/或人脸的第一关键点信息,实现多种形式的融合,从而进一步增加融合特征信息的全面性和准确性,继而提升基于融合特征信息所得到的人脸器官的第二关键点信息的准确度。
在一种可能的实现方式中,在步骤S12通过步骤S121和步骤S122实现的情况下,在一种可能的实现方式中,在步骤S122之前,步骤S12还可以包括:
对至少一个人脸器官的检测框进行增强处理,其中,增强处理包括:伸缩变换处理和/或平移变换处理。
其中,检测框信息可以是第一网络分支输出的人脸器官的检测框信息,在此不再赘述。在一种可能的实现方式中,为了增强训练过程中数据的丰富度,可以对检测框信息进行增强处理,比如上述公开实施例中提到的伸缩变换处理和/或平移变换处理等。
伸缩变换处理可以是对得到的检测框信息中的检测框进行扩展或压缩。在一种可能的实现方式中,可以是在预设伸缩范围下,对检测框进行随机的伸缩变换。预设伸缩范围的数值可以根据实际情况灵活设定,不局限于下述各公开实施例。在一个示例中,预设伸缩范围可以是检测框大小的0.9倍~1.1倍之间。
平移变换处理可以是对得到的检测框信息中的检测框进行整体位置的移动。在一种可能的实现方式中,可以是在预设平移范围下,对检测框进行随机的平移。预设平移范围同样可以根据实际情况灵活设定。在一个示例中,预设平移范围可以是检测框在平移方向上长度的±0.05倍之间,其中,±中的“+”和“-”分别代表平移方向以及平移方向的反方向。
在得到增强处理的检测框信息后,可以通过步骤S122,利用包含增强后的检测框信息的第一检测结果,通过至少一个第二网络分支对至少一个人脸器官进行检测得到第二检测结果。检测的具体方式可以参考上述各公开实施例,在此不再赘述。
通过对至少一个人脸器官的检测框信息进行增强处理,可以增加用于训练目标神经网络的训练数据的丰富度,从而使得训练后的目标神经网络在不同的输入数据下均得到较好的关键点检测效果,提升目标神经网络的处理精度和鲁棒性,从而提升关键点检测的准确性。
在一种可能的实现方式中,本公开实施例中提出的关键点检测方法还可以包括:
获取至少一个人脸器官的第二关键点信息中符合预设精度的人脸器官的第二关键点;
根据符合预设精度的人脸器官的第二关键点,对人脸的第一关键点信息中与符合预设精度的人脸器官对应位置的人脸的第一关键点进行替换,得到更新的人脸的第一关键点信息。
其中,预设精度可以根据实际情况灵活设定,在本公开实施例中不做限制。可以是人为设置的某个精度,也可以是人脸的第一关键点信息的精度等。
由于人脸器官的第二关键点信息可以基于人脸图像中的人脸器官区域确定,在一些可能的实现方式中,人脸器官的关键点可能具有较高的精度。又由于确定的人脸的第一关键点信息以及人脸器官的第二关键点信息中,可能包含位置相同的同一关键点,因此,在一些可能的实现方式中,在人脸的第一关键点存在位置对应的人脸器官的第二关键点,且人脸器官的第二关键点符合预设精度的情况下,可以将人脸器官的第二关键点作为对应位置的人脸的第一关键点,从而实现对人脸的第一关键点信息的替换,得到更新的人脸的第一关键点信息。
通过获取人脸器官的第二关键点信息中符合预设精度的人脸器官的第二关键点,并根据符合预设精度的人脸器官的第二关键点,对人脸的第一关键点信息中与人脸器官对应位置的人脸的第一关键点进行替换,得到更新的人脸的第一关键点信息,可以进一步提升关键点检测的精度,得到符合精度需求的关键点。
在一种可能的实现方式中,本公开实施例中提出的关键点检测方法,还可以包括:
对每个人脸器官的第二关键点信息在人脸器官区域中的位置进行转换,得到人脸器官的第二关键点信息在人脸图像中的位置。
在一种可能的实现方式中,由于人脸器官的第二关键点信息可以是基于人脸器官区域进行提取所得到的,因此,得到的人脸器官的第二关键点的位置可能是在人脸器官区域中的位置。而人脸的第一关键点由于是基于对人脸图像进行处理所得到的,因此人脸的第一关键点的位置可能是在人脸图像中的位置。
因此,在一种可能的实现方式中,可以对人脸器官的第二关键点信息在人脸器官区域中的位置进行转换,来得到人脸器官的第二关键点信息在人脸图像的位置。转换的方式在本公开实施例中不做限制。在一种可能的实现方式中,可以根据人脸图像与人脸器官区域的顶点或是中心点坐标等,来确定人脸图像与人脸器官区域之间的位置变换关系,并基于该位置变换关系,对人脸器官的第二关键点在人脸器官区域中的位置进行位置变换,从而得到人脸器官的第二关键点在人脸图像中的位置。
通过对人脸器官的第二关键点信息在人脸器官区域中的位置进行转换,得到人脸器官的第二关键点信息在人脸图像中的位置,可以将人脸的第一关键点信息和人脸器官的第二关键点信息的位置进行统一,便于后续对人脸中各关键点的分析和操作处理。
在一些可能的实现方式中,也可以将人脸的第一关键点信息在人脸图像中的位置转换到人脸器官区域中,或是将某人脸器官的第二关键点信息在对应的人脸器官区域中的位置,转换到其他人脸器官区域中等,也可以将人脸的第一关键点信息以及各人脸器官的第二关键点信息均转换至某一预设的图像坐标系下等。具体如何转换可以根据实际情况灵活选择,不局限于上述各公开实施例。
在一种可能的实现方式中,目标神经网络还可以包括至少一个第三网络分支,该至少一个第三网络分支可以用于根据人脸的第一关键点信息进行人脸状态的检测。
其中,至少一个第三网络分支用于获取人脸图像在某种或某些状态下的检测结果。其中的状态可以是人脸图像中反映的人脸情况,具体包含哪些状态可以根据实际情况灵活设定。在一些可能的实现方式中,状态可以是人脸本身的情况,比如人脸中的眼睛是睁眼还是闭眼,人脸对应的对象是哪一对象等。在一些可能的实现方式中,状态也可以是人脸在图像中的情况,比如人脸在图像中是否被遮挡等。
至少一个第三网络分支可以检测的状态数量在本公开实施例中也不做限制。在一种可能的实现方式中,一个第三网络分支可以输出人脸图像在一个状态下的检测结果。在一种可能的实现方式中,一个第三网络分支也可以输出人脸图像在多个状态下的检测结果。
目标神经网络包含的第三网络分支的数量在本公开实施例中也不做限制。在一 种可能的实现方式中,目标神经网络可以包含多个第三网络分支,通过多个第三网络分支,可以分别对人脸图像实现多种状态的检测。在一种可能的实现方式中,目标神经网络也可以仅包含一个第三网络分支,通过该第三网络分支,可以对人脸图像实现某个状态的检测,也可以对人脸图像实现多个状态的检测等。
目标神经网络中第三网络分支的位置在本公开实施例中也不做限制,可以根据实际情况灵活设定,不局限于下述各公开实施例。在一种可能的实现方式中,第三网络分支可以与第一网络分支的输出相连接。在一种可能的实现方式中,第三网络分支也可以与第一网络分支的特征提取层相连接。在一些可能的实现方式中,第三网络分支也可以与某个或某些第二网络分支相连接等。
通过包括至少一个第三网络分支的目标神经网络,一方面,可以进一步对人脸图像进行状态检测,从而对得到的人脸关键点信息集合的精度等进行辅助判断;另一方面,还可以进一步实现端到端的关键点检测,同时便于引入新的检测模型,实现端到端的人脸状态检测等。
在一种可能的实现方式中,本公开实施例中提出的方法还可以包括:
步骤S13,根据人脸关键点信息集合,对人脸图像所在的人脸图像帧序列中的人脸进行追踪。其中,如何利用检测到的关键点对人脸图像进行追踪,其实现方式可以根据实际情况灵活决定,不局限于下述各公开实施例。在一种可能的实现方式中,步骤S13可以包括:
根据人脸的第一关键点信息和/或人脸器官的第二关键点信息,确定至少一个目标关键点;
将人脸图像在人脸图像帧序列中的下一帧图像作为目标人脸图像,根据至少一个目标关键点,对目标人脸图像进行校正,得到校正后的目标人脸图像;
将校正后的目标人脸图像输入目标神经网络,根据目标神经网络的输出,对人脸图像与目标人脸图像中的相同对象进行跟踪。
人脸图像帧序列中的对象可以是人脸图像中包含的人脸所对应的对象。如上述各公开实施例所述,人脸图像中可能包含多个人脸,因此本公开实施例中提出的关键点检测方法,可以对单个对象进行跟踪,也可以对多个对象进行跟踪。
目标关键点可以是跟踪过程中,用于判断对象位置的关键点。具体选择哪些点作为目标关键点,在本公开实施例中不做限定,不局限于下述各公开实施例。在一种可能的实现方式中,可以将人脸的第一关键点作为目标关键点。在一种可能的实现方式中,也可以将人脸器官的第二关键点作为目标关键点。在一种可能的实现方式中,也可以将人脸的第一关键点和人脸器官的第二关键点作为目标关键点。在一种可能的实现方式中,还可以根据跟踪实际需求,将人脸器官的第二关键点信息中的部分关键点替换至人脸的第一关键点信息中,得到目标关键点。
在确定至少一个目标关键点以后,可以将人脸图像帧序列中,人脸图像的下一帧图像作为目标人脸图像,并基于目标关键点,对目标人脸图像进行校正,来得到校正后的目标人脸图像。
其中,校正的方式可以根据实际情况灵活选择,详见下述各公开实施例,在此先不做展开。由于目标人脸图像是人脸图像在人脸图像帧序列中的下一帧图像,因此目标人脸图像中的人脸相对于人脸图像中的人脸来说,可能发生移动(平移或是旋转等)。如果目标人脸图像相对于人脸图像的移动较大,直接将目标人脸图像输入至目标神经网络进行处理,可能无法检测到目标人脸图像中的人脸的第一关键点信息或是人脸器官的第二关键点信息等。因此,在一种可能的实现方式中,可以利用人脸图像对应的至少一个目标关键点,对目标人脸图像进行校正,使得校正后的目标人脸图像可以在目标神经网络中具有较为准确的关键点检测结果,从而使得跟踪可以持续进行,提升跟踪的持续 性和准确性。
在得到校正后的目标人脸图像以后,校正后的目标人脸图像可以作为新的人脸图像,输入至目标神经网络,通过上述各公开实施例中提出的关键点检测方法进行处理,来得到相应的人脸的第一关键点信息和/或人脸器官的第二关键点信息,继而确定该新的人脸图像的目标关键点。通过比较人脸图像与新的人脸图像(即校正后的目标人脸图像)中相对应的目标关键点的变化情况,可以确定对象的位置变化过程,实现对对象的跟踪。在一些可能的实现方式中,还可以在人脸图像帧序列中,继续获取新的人脸图像的下一帧图像作为新的目标人脸图像,并重复上述各公开实施例中的过程,来对人脸图像帧序列中的对象进行连续跟踪。
通过根据人脸的第一关键点信息和/或人脸器官的第二关键点信息,确定目标关键点,并根据目标关键点对目标人脸图像进行校正,从而基于校正后的目标人脸图像在目标神经网络中的输出,对人脸图像与目标人脸图像中的相同对象进行跟踪,可以在人脸图像所在的人脸图像帧序列中,基于当前帧的关键点检测结果,对下一帧图像进行预先校正,从而提升人脸图像帧序列中各帧图像进行关键点检测的可行性和准确度,继而提升跟踪的持续性和准确性。
如上述各公开实施例所述,校正的过程可以根据实际情况灵活决定。在一种可能的实现方式中,根据至少一个目标关键点,对目标人脸图像进行校正,得到校正后的目标人脸图像,可以包括:
根据至少一个目标关键点,结合预设模板,得到仿射变换矩阵;
通过仿射变换矩阵,对目标人脸图像向预设模板的方向进行校正,得到校正后的目标人脸图像。
其中,预设模板可以是预先设定的平均人脸姿态(meanpose),该预设模板中人脸的具体姿态可以根据实际情况灵活设定,在本公开实施例中不做限制。由于目标关键点可以反映人脸图像中人脸的姿态,因此根据至少一个目标关键点和预设模板进行计算,可以确定人脸图像中人脸相对于预设模板中人脸的移动情况,该移动情况可以通过仿射变换矩阵的形式进行表示。具体如何计算得到仿射变换矩阵,可以根据预设目标和目标关键点的实际情况所确定,在本公开实施例中不做限制。
得到仿射变换矩阵以后,由于目标人脸图像是人脸图像的下一帧图像,因此目标人脸图像中的人脸相对于人脸图像中的人脸进行进一步移动。而通过反映人脸图像中人脸移动情况的仿射变换矩阵,对目标人脸图像进行仿射变换,可以将目标人脸图像中的人脸移动至与预设模板中的人脸姿态更为接近的情况,即得到的校正后的目标人脸图像中的人脸更接近于预设模板中的人脸方向。通过仿射变换矩阵对目标人脸图像进行校正的过程可以参考仿射变换的过程,在本公开实施例中不做限定。
通过上述校正过程,将校正后的目标人脸图像输入至目标神经网络,一方面更容易得到关键点检测结果,特别是在人脸图像帧序列中的人脸图像具有较大角度偏移的情况下,可以通过校正提升关键点检测的成功率,另一方面也可以提升得到的关键点检测结果的精度,继而提升跟踪的精度。
在一种可能的实现方式中,本公开实施例中的人脸图像还可以为包含关键点标注的人脸图像。如上述各公开实施例所述,本公开实施例中提出的关键点检测方法,可以基于目标神经网络进行实现,因此,在一种可能的实现方式中,还可以基于包含关键点标注的人脸图像,对目标神经网络进行训练。在这种情况下,本公开实施例中提出的方法可以用于对目标神经网络进行训练。
在人脸图像包括关键点标注的情况下,为了实现训练,该关键点标注可以包含人脸的第一关键点信息标注和/或人脸器官的第二关键点信息标注。其中,人脸的第一关键点信息标注可以是对人脸图像中人脸的第一关键点信息的实际位置所进行的标注,人 脸器官的第二关键点信息标注可以是对人脸图像中人脸器官的第二关键点信息的实际位置所进行的标注。标注的方式在本公开实施例中不做限制。在一个示例中,可以通过人工手动对人脸图像中的人脸的第一关键点信息以及人脸器官的第二关键点信息进行标注。在一个示例中,也可以是通过机器来对人脸图像中的人脸的第一关键点信息以及人脸器官的第二关键点信息进行自动化标注等。
在一些可能的实现方式中,在目标神经网络还包括至少一个第三网络分支的情况下,人脸图像还可以包含第三网络分支所对应的人脸状态的标注。举例来说,目标神经网络还包含用于检测人脸中眼睛开闭状态的第三网络分支的情况下,可以根据人脸图像中眼睛的真实开闭情况,对人脸图像进行眼睛开闭状态的标注。
图5示出根据本公开一实施例的关键点检测方法的流程图。如图5所示,在一种可能的实现方式中,在人脸图像包括关键点标注的情况下,本公开实施例中提出的关键点检测方法可以包括以下步骤。
步骤S11,获取人脸图像。
步骤S12,利用目标神经网络包括的至少两个神经网络分支,对人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,人脸关键点信息集合包括人脸的第一关键点信息以及至少一个人脸器官的第二关键点信息。
步骤S14,根据关键点标注以及人脸关键点信息集合,确定目标神经网络的误差损失。
步骤S15,根据误差损失,对目标神经网络中的至少两个神经网络分支的参数进行共同更新。
其中,步骤S11~步骤S12的实现形式可以参考上述各公开实施例,在此不再赘述。
在得到上述人脸关键点信息集合以后,可以通过步骤S14,根据人脸图像中标注的人脸的第一关键点信息和人脸器官的第二关键点信息的实际位置,来确定预测的关键点与标注的关键点之间的误差,从而确定目标神经网络的误差损失。并通过步骤S15,利用误差损失对第一网络分支以及第二网络分支中的参数进行共同更新。
步骤S14中,确定误差损失的具体过程可以根据实际情况灵活决定,详见下述各公开实施例,在此先不做展开。在确定误差损失以后,可以根据误差损失对目标神经网络中的各参数进行反向更新。通过上述各公开实施例可以看出,本公开实施例中的目标神经网络可以包括第一网络分支以及至少一个第二网络分支,因此,在一种可能的实现方式中,在对目标神经网络的参数进行更新的过程中,第一网络分支和至少一个第二网络分支的参数更新可以同时进行,即第一网络分支和至少一个第二网络分支中的参数可以依据两种网络的输出进行共同优化,从而使得训练后得到的目标神经网络,可以在人脸的第一关键点信息的检测和人脸器官的第二关键点信息的检测过程中,达到全局最优的效果。
在一些可能的实现方式中,在目标神经网络还包含至少一个第三网络分支的情况下,至少一个第三网络分支可以与第一网络分支和至少一个第二网络分支共同进行训练,即至少一个第三网络分支的参数可以与第一网络分支以及至少一个第二网络分支共同更新。在一些可能的实现方式中,至少一个第三网络分支也可以单独进行训练,即在更新至少一个第三网络分支的参数的过程中,可以固定第一网络分支以及第二网络分支的参数。
通过根据人脸图像,确定目标神经网络的误差损失,并根据误差损失,对第一网络分支和至少一个第二网络分支中的参数进行共同更新,可以对第一网络分支和至少一个第二网络分支进行共同训练,从而使得训练后的目标神经网络所得到的人脸的第一关键点信息检测结果和人脸器官的第二关键点信息检测结果具有一致性,并且都具有较 高的检测精度。
如上述公开实施例所述,步骤S14的实现方式可以根据实际情况灵活决定。在一种可能的实现方式中,步骤S14可以包括以下过程中的至少一种:
根据人脸的第一关键点信息与人脸的第一关键点信息标注之间的第一误差,确定目标神经网络的第一误差损失;
根据人脸器官的第二关键点信息与人脸器官的第二关键点信息标注之间的第二误差,确定目标神经网络的第二误差损失;
根据人脸的第一关键点信息标注和/或人脸器官的第二关键点信息标注,确定人脸图像中至少一个人脸器官的检测框位置标注,根据至少一个人脸器官的检测框信息与至少一个人脸器官的检测框位置标注之间的第三误差,确定目标神经网络的第三误差损失。
如上述各公开实施例所述,人脸图像可以包括人脸的第一关键点信息标注,该标注可以表明训练图像中人脸的第一关键点信息的实际位置。因此,在一种可能的实现方式中,可以根据人脸的第一关键点信息标注,与目标神经网络预测的人脸的第一关键点信息二者之间所形成的第一误差,来确定目标神经网络的误差损失。具体误差损失的计算方式可以根据实际情况灵活设定,在本公开实施例中不做限制。
同理,还可以根据人脸器官的第二关键点信息标注,与目标神经网络预测的人脸器官的第二关键点信息二者之间所形成的第二误差,来确定目标神经网络的误差损失,其计算方式同样可以根据实际情况灵活选择。
在一种可能的实现方式中,如上述各公开实施例所述,目标神经网络中的第一网络分支还可以确定至少一个人脸器官的检测框信息。而根据人脸的第一关键点信息标注和人脸器官的第二关键点信息标注,也可以对人脸中的各器官进行定位,从而计算出训练图像中各器官的检测框位置,作为人脸图像中的检测框位置标注。因此,在一种可能的实现方式中,还可以根据目标神经网络预测的各器官的检测框信息,与相应的器官的检测框位置标注之间所形成的第三误差,来确定目标神经网络的误差损失。检测框位置标注的计算方式,以及根据第三误差确定目标神经网络的误差损失的计算方式,均可以根据实际情况灵活选择,在本公开实施例中不做限制。
在一种可能的实现方式中,上述各确定目标神经网络的误差损失的方式可以相互结合。具体选择哪种或哪几种方式来共同确定目标神经网络的误差损失,也可以根据实际情况灵活选择,在本公开实施例中不做限制。
在一些可能的实现方式中,在人脸图像还包含人脸状态标注的情况下,也可以根据目标神经网络对人脸图像的状态检测结果,与人脸状态标注之间的误差,来确定目标神经网络的误差损失。
通过上述各种过程来确定目标神经网络的误差损失,可以使得目标神经网络的训练过程更加灵活和丰富,从而使得训练得到的目标神经网络具有更好的关键点检测效果,且得到的人脸的第一关键点信息和人脸器官的第二关键点信息具有更高的一致性。
图6示出根据本公开一实施例的关键点检测装置的框图。如图6所示,所述关键点检测装置20可以包括:
图像获取模块21,用于获取人脸图像。
关键点检测模块22,用于利用目标神经网络包括的至少两个神经网络分支,对人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,人脸关键点信息集合包括人脸的关第一键点信息以及该至少一个人脸器官的第二关键点信息。
在一种可能的实现方式中,至少两个神经网络分支包括用于检测人脸的第一网络分支,以及,用于检测至少一个人脸器官的至少一个第二网络分支;关键点检测模块 用于:通过第一网络分支对人脸进行检测,得到第一检测结果,第一检测结果包括人脸的第一关键点信息以及至少一个人脸器官的检测框信息;基于第一检测结果以及至少一个第二网络分支,对至少一个人脸器官进行检测,得到第二检测结果,第二检测结果包括至少一个人脸器官的第二关键点信息。
在一种可能的实现方式中,关键点检测模块进一步用于:对于至少一个第二网络分支中的每一第二网络分支,利用该第二网络分支对该第二网络分支对应的人脸器官进行检测,得到该第二网络分支对应的人脸器官的第二检测结果。
在一种可能的实现方式中,在对该第二网络分支对应的人脸器官进行检测时,关键点检测模块用于:从人脸图像中提取出与人脸器官的检测框信息匹配的人脸器官区域;提取人脸器官区域的第二特征信息;基于第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息。
在一种可能的实现方式中,第一检测结果还包括人脸图像的第一特征信息。其中,在对该第二网络分支对应的人脸器官进行检测时,关键点检测模块用于:从第一特征信息中提取出与人脸器官的检测框信息匹配的人脸器官区域的初始特征信息;对初始特征信息进行深层特征提取,得到人脸器官区域的第二特征信息;基于第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息。
在一种可能的实现方式中,在基于第二特征信息以及第一检测结果,确定人脸器官的第二关键点信息时,关键点检测模块进一步用于:将第二特征信息与第一特征信息和/或第一关键点信息进行至少一次融合处理,得到融合特征信息;根据融合特征信息,得到第二关键点信息。
在一种可能的实现方式中,关键点检测模块还用于:在基于第一检测结果以及少一个第二网络分支,对至少一个人脸器官进行检测,得到第二检测结果之前,对至少一个人脸器官的检测框信息进行增强处理,其中,增强处理包括:伸缩变换处理和/或平移变换处理。
在一种可能的实现方式中,装置还用于:获取至少一个人脸器官的第二关键点信息中符合预设精度的人脸器官的第二关键点;根据符合预设精度的人脸器官的第二关键点,对第一关键点信息中与符合预设精度的人脸器官对应位置的人脸的第一关键点进行替换,得到更新的人脸的第一关键点信息。
在一种可能的实现方式中,装置还用于:对人脸的第一关键点信息和/或每个人脸器官的第二关键点信息进行位置转换。
在一种可能的实现方式中,目标神经网络还包括至少一个第三网络分支,至少一个第三网络分支用于根据第一关键点信息进行人脸状态的检测。
在一种可能的实现方式中,装置还用于:根据人脸关键点信息集合,对人脸图像所在的人脸图像帧序列中的人脸进行追踪。
在一种可能的实现方式中,装置还用于:利用所述人脸图像对所述人脸图像帧序列中所述人脸图像的下一帧进行校正。
在一种可能的实现方式中,第一关键点信息包括68至128个第一关键点。
在一种可能的实现方式中,第二关键点信息包括,40至80个嘴部的关键点,16至32个左眼的关键点,16至32个右眼的关键点,10至20个左眉毛的关键点,和/或,10至20个右眉毛的关键点。
在一种可能的实现方式中,人脸图像包括关键点标注;装置还用于:根据关键点标注以及人脸关键点信息集合,确定目标神经网络的误差损失;根据误差损失,对目标神经网络中的至少两个神经网络分支的参数进行共同更新。
应用场景示例
本公开应用示例提出了一种关键点检测方法,可以对人脸图像进行关键点检测。
图3和图7分别示出根据本公开一应用示例的关键点检测方法示意图,其中图3为关键点检测方法的应用过程示意图,图7为关键点检测方法的训练过程示意图。如图3所示,在本公开应用示例中,关键点检测方法可以包括如下过程。
如图3所示,在本公开应用示例中,获取的人脸图像输入目标神经网络以后,分别通过目标神经网络中的第一网络分支和五个第二网络分支进行处理。
其中,如图3所示,第一网络分支包括依次连接的浅层特征提取模块和主模块。浅层特征提取模块如上述各公开实施例所述的浅层特征提取网络结构block 0,对人脸图像的第一特征信息进行初步提取。主模块如上述各公开实施例所述的深层特征提取网络结构block main,对人脸图像的第一特征信息进行进一步提取与回归。从图中可以看出,第一网络分支对人脸图像进行处理以后,可以分别输出106个人脸的第一关键点信息(即图中的106人脸整体关键点),以及人脸图像中各人脸器官的检测框信息。
进一步地,如图3所示,五个第二网络分支相互独立,分别对人脸中的嘴部、左眼、右眼、左眉毛以及右眉毛进行第二关键点检测。其中,嘴部的第二网络分支包括依次连接的感兴趣区域校准层(ROI Align)以及嘴部特征提取模块。感兴趣区域校准层的实现形式可以参考上述各公开实施例,在此不再赘述。从图3中可以看出,感兴趣区域校准层可以根据第一网络分支输出的嘴部检测框信息,对人脸图像进行剪切,从而得到与预设图像大小相符合的嘴部区域。嘴部特征提取模块可以包括一个或多个用于特征提取的网络层,可以对嘴部区域进行特征提取,得到嘴部区域的第二特征信息。从图3中可以看出,在一个示例中,嘴部区域的第二特征信息可以与第一网络分支中输出的106个人脸的第一关键点信息进行融合,得到融合特征信息。融合特征信息通过嘴部的第二网络分支进行回归,可以输出嘴部的第一关键点信息。如图所示,在本公开应用示例中,嘴部的第二网络分支根据输入的嘴部检测框信息从人脸图像中提取嘴部区域,并将嘴部区域的第二特征信息与人脸的第一关键点信息进行融合,得到融合特征信息,基于融合特征信息可以输出64个嘴部的第二关键点信息。在一个示例中,这64个嘴部的第二关键点信息可以与106个人脸的第一关键点信息,通过上述各公开实施例中提到的位置转换的方式,统一到同一个位置坐标系下。
左眼和右眼的第二网络分支的实现形式可以参考上述嘴部的第二网络分支,在此不再赘述。如图3所示,左眼的第二网络分支可以输出24个左眼的第二关键点信息,右眼的第二网络分支可以输出24个右眼的第二关键点信息。
左眉毛的第二网络分支的实现形式与上述嘴部的第二网络分支类似。从图3中可以看出,在本公开应用示例中,左眉毛的第二网络分支可以利用感兴趣区域校准层,基于左眉毛的检测框信息,从第一网络分支中block 0输出的人脸图像的第一特征信息中,提取左眉毛区域的初始特征信息,并基于该初始特征信息进行深层特征提取,得到左眉毛区域的第二特征信息,其余过程均与嘴部的第二网络分支相同。右眉毛的第二网络分支的实现形式可以参考左眉毛,在此不再赘述。从图3中可以看出,在本公开应用示例中,左眉毛的第二网络分支可以输出13个左眉毛的第二关键点信息,右眉毛的第二网络分支可以输出13个右眉毛的第二关键点信息。
从图3中可以看出,在本公开应用示例中,为了对齐人脸器官的第二关键点信息精度,第二网络分支除了可以输出64个嘴部的第二关键点信息、24个左眼的第二关键点信息、24个右眼的第二关键点信息、13个左眉毛的第二关键点信息以及13个右眉毛的第二关键点信息以外,还可以输出一些与人脸的第一关键点信息位置对应的人脸器官的第二关键点信息。这些与人脸的第一关键点信息位置对应的人脸器官的第二关键点信息可以替换进第一网络分支输出的106个人脸的第一关键点信息中,得到最终的106个人脸的第一关键点信息。
除此以外,本公开应用示例中的目标神经网络也可以包含至少一个第三网络分支,来对人脸图像中人脸的状态进行检测,具体过程详见上述各公开实施例,在此不再赘述。
通过上述过程,可以利用单个目标神经网络来同时得到人脸的第一关键点信息以及各人脸器官的第二关键点信息的检测结果,且利用感兴趣区域校准层ROI Align对人脸器官区域进行提取,既节约了整个流程中的处理时间,降低了关键点检测的总耗时,同时也提升了得到的人脸器官区域的精度,继而提升检测得到的人脸器官的第二关键点信息的精度。同时,在各第二网络分支中,可以将人脸器官区域的第二特征信息,与第一网络分支输出的人脸的第一关键点信息进行融合得到融合特征信息,从而根据融合特征信息得到输出的人脸器官的第二关键点信息,通过上述过程,可以使得人脸器官的第二关键点信息与人脸的第一关键点信息更加统一,从而提升关键点检测的精度。
进一步地,由于本公开应用示例提出的关键点检测方法可以通过目标神经网络,因此,本公开应用示例提出的方法还可以用于对目标神经网络的训练过程。如图7所示,对目标神经网络进行训练的过程与上述应用过程的流程基本一致,区别在于,训练过程中,人脸图像包含各关键点的真值标注,且第一网络分支输出的检测框信息经过增强处理后,再输入至各第二网络分支,增强处理的方式可以参考上述各公开实施例,在此不再赘述。在一个示例中,在训练过程中,为了确定左眉毛和右眉毛的位置,可以根据训练图像中各关键点的真值标注,对左眉毛和右眉毛的检测框位置进行计算,来得到左眉毛和右眉毛的真值检测框(即图中的检测框(真值)),输入至相应的第二网络分支中。在一些可能的实现方式中,在目标神经网络还包含第三网络分支的情况下,训练图像中还可以包含第三网络分支所检测的人脸状态的真值标注。
在训练过程中,第一网络分支和各第二网络分支可以同时进行训练,并共同进行参数优化,从而达到全局最优。通过上述训练过程,可以对整个目标神经网络实现端到端的全局优化,从而提高目标神经网络的关键点检测精度。
本公开应用示例中提出的关键点检测方法,除了可以应用于人脸图像的关键点检测以外,也可以扩展应用于其他图像的处理,比如人体图像、骨骼图像等。
可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是易失性计算机可读存储介质或非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为实现上述方法。
在实际应用中,上述存储器可以是易失性存储器(volatile memory),例如RAM;或者非易失性存储器(non-volatile memory),例如ROM,快闪存储器(flash memory),硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD);或者上述种类的存储器的组合,并向处理器提供指令和数据。
上述处理器可以为ASIC、DSP、DSPD、PLD、FPGA、CPU、控制器、微控制器、微处理器中的至少一种。可以理解地,对于不同的设备,用于实现上述处理器功能的还可以为其它电子器件,本公开实施例不作具体限定。
电子设备可以被提供为终端、服务器或其它形态的设备。
基于前述实施例相同的技术构思,本公开实施例还提供了一种计算机程序,该计算机程序被处理器执行时实现上述方法。
图8是根据本公开实施例的一种电子设备800的框图。例如,电子设备800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。
参照图8,电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为电子设备800提供各个方面的状态评估。例如,传感器组件814可以检测到电子设备800的打开/关闭状态,组件的相对定位,例如所述组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感 器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关人员信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由电子设备800的处理器820执行以完成上述方法。
图9是根据本公开实施例的一种电子设备1900的框图。例如,电子设备1900可以被提供为一服务器。参照图9,电子设备1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理,一个有线或无线网络接口1950被配置为将电子设备1900连接到网络,和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由电子设备1900的处理组件1922执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、 交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态人员信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (19)

  1. 一种关键点检测方法,包括:
    获取人脸图像;
    利用目标神经网络包括的至少两个神经网络分支,对所述人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,所述人脸关键点信息集合包括所述人脸的第一关键点信息以及所述至少一个人脸器官的第二关键点信息。
  2. 根据权利要求1所述的方法,其特征在于,所述至少两个神经网络分支包括用于检测所述人脸的第一网络分支,以及,用于检测所述至少一个人脸器官的至少一个第二网络分支;
    所述利用所述目标神经网络包括的所述至少两个神经网络分支,对所述人脸图像中所述人脸以及所述至少一个人脸器官进行检测,得到所述人脸关键点信息集合,包括:
    通过所述第一网络分支对所述人脸进行检测,得到第一检测结果,所述第一检测结果包括所述人脸的所述第一关键点信息以及所述至少一个人脸器官的检测框信息;
    基于所述第一检测结果,通过所述至少一个第二网络分支对所述至少一个人脸器官进行检测,得到第二检测结果,所述第二检测结果包括所述至少一个人脸器官的所述第二关键点信息。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述第一检测结果,通过所述至少一个第二网络分支对所述至少一个人脸器官进行检测,得到所述第二检测结果,包括:
    对于所述至少一个第二网络分支中的每一第二网络分支,利用所述第二网络分支对所述第二网络分支对应的所述人脸器官进行检测,得到所述第二网络分支对应的所述人脸器官的所述第二检测结果。
  4. 根据权利要求3所述方法,其特征在于,所述对所述第二网络分支对应的所述人脸器官进行检测,得到所述第二网络分支对应的所述人脸器官的所述第二检测结果,包括:
    从所述人脸图像中提取出与所述人脸器官的所述检测框信息匹配的人脸器官区域;
    提取所述人脸器官区域的第二特征信息;
    基于所述第二特征信息以及所述第一检测结果,确定所述人脸器官的所述第二关键点信息。
  5. 根据权利要求3所述的方法,其特征在于,所述第一检测结果还包括所述人脸图像的第一特征信息;
    其中,所述对所述第二网络分支对应的所述人脸器官进行检测,得到所述第二网络分支对应的所述人脸器官的所述第二检测结果,包括:
    从所述第一特征信息中提取出与所述人脸器官的所述检测框信息匹配的人脸器官区域的初始特征信息;
    对所述初始特征信息进行深层特征提取,得到所述人脸器官区域的第二特征信 息;
    基于所述第二特征信息以及所述第一检测结果,确定所述人脸器官的所述第二关键点信息。
  6. 根据权利要求4或5所述的方法,其特征在于,所述基于所述第二特征信息以及所述第一检测结果,确定所述人脸器官的所述第二关键点信息,包括:
    将所述第二特征信息与所述第一特征信息和/或所述第一关键点信息进行至少一次融合处理,得到融合特征信息;
    根据所述融合特征信息,得到所述第二关键点信息。
  7. 根据权利要求2至6中任意一项所述的方法,其特征在于,在所述基于所述第一检测结果,通过所述至少一个第二网络分支对所述至少一个人脸器官进行检测,得到所述第二检测结果之前,还包括:
    对所述至少一个人脸器官的检测框信息进行增强处理,其中,所述增强处理包括以下各项中的至少一项:伸缩变换处理或平移变换处理。
  8. 根据权利要求1至7中任意一项所述的方法,还包括:
    获取所述至少一个人脸器官的所述第二关键点信息中符合预设精度的人脸器官的第二关键点;
    根据所述符合预设精度的人脸器官的所述第二关键点,对所述第一关键点信息中与所述符合预设精度的人脸器官对应位置的所述人脸的第一关键点进行替换,得到更新的所述人脸的第一关键点信息。
  9. 根据权利要求8所述的方法,还包括:
    对所述人脸的所述第一关键点信息和/或每个人脸器官的所述第二关键点信息进行位置转换。
  10. 根据权利要求1至9中任意一项所述的方法,其特征在于,所述目标神经网络还包括至少一个第三网络分支,所述至少一个第三网络分支用于根据所述第一关键点信息进行人脸状态的检测。
  11. 根据权利要求1至10中任意一项所述的方法,还包括:
    根据所述人脸关键点信息集合,对所述人脸图像所在的人脸图像帧序列中的所述人脸进行追踪。
  12. 根据权利要求11所述的方法,还包括:
    利用所述人脸图像对所述人脸图像帧序列中所述人脸图像的下一帧进行校正。
  13. 根据权利要求1至12中任意一项所述的方法,其特征在于,所述第一关键点信息包括68至128个第一关键点。
  14. 根据权利要求1至13中任意一项所述的方法,其特征在于,所述第二关键点信息包括以下各项中的至少一项:
    嘴部的关键点,数量为40至80个;
    左眼的关键点,数量为16至32个;
    右眼的关键点,数量为16至32个;
    左眉毛的关键点,数量为10至20个;
    右眉毛的关键点,数量为10至20个。
  15. 根据权利要求1至14中任意一项所述的方法,其特征在于,所述人脸图像包括关键点标注;
    所述方法还包括:
    根据所述关键点标注以及所述人脸关键点信息集合,确定所述目标神经网络的误差损失;
    根据所述误差损失,对所述目标神经网络中的所述至少两个神经网络分支的参数进行共同更新。
  16. 一种关键点检测装置,包括:
    图像获取模块,用于获取人脸图像;
    关键点检测模块,用于利用目标神经网络包括的至少两个神经网络分支,对所述人脸图像中人脸以及至少一个人脸器官进行检测,得到人脸关键点信息集合,所述人脸关键点信息集合包括所述人脸的第一关键点信息以及所述至少一个人脸器官的第二关键点信息。
  17. 一种电子设备,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至15中任意一项所述的方法。
  18. 一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现权利要求1至15中任意一项所述的方法。
  19. 一种计算机程序,所述计算机程序包括计算机可读代码,所述计算机可读代码被处理器执行时实现如权利要求1至15中任意一项所述的方法。
PCT/CN2021/108612 2020-12-29 2021-07-27 关键点检测方法及装置、电子设备和存储介质 WO2022142298A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011596380.1A CN112597944B (zh) 2020-12-29 2020-12-29 关键点检测方法及装置、电子设备和存储介质
CN202011596380.1 2020-12-29

Publications (1)

Publication Number Publication Date
WO2022142298A1 true WO2022142298A1 (zh) 2022-07-07

Family

ID=75204108

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108612 WO2022142298A1 (zh) 2020-12-29 2021-07-27 关键点检测方法及装置、电子设备和存储介质

Country Status (3)

Country Link
CN (1) CN112597944B (zh)
TW (1) TW202226049A (zh)
WO (1) WO2022142298A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789040A (zh) * 2024-02-28 2024-03-29 华南农业大学 一种扰动状态下的茶芽叶姿态检测方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597944B (zh) * 2020-12-29 2024-06-11 北京市商汤科技开发有限公司 关键点检测方法及装置、电子设备和存储介质
CN116499445B (zh) * 2023-06-30 2023-09-12 成都市晶蓉微电子有限公司 一种mems陀螺数字输出单片集成系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229293A (zh) * 2017-08-09 2018-06-29 北京市商汤科技开发有限公司 人脸图像处理方法、装置和电子设备
CN109635752A (zh) * 2018-12-12 2019-04-16 腾讯科技(深圳)有限公司 人脸关键点的定位方法、人脸图像处理方法和相关装置
CN112597944A (zh) * 2020-12-29 2021-04-02 北京市商汤科技开发有限公司 关键点检测方法及装置、电子设备和存储介质

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832672B (zh) * 2017-10-12 2020-07-07 北京航空航天大学 一种利用姿态信息设计多损失函数的行人重识别方法
CN107967456A (zh) * 2017-11-27 2018-04-27 电子科技大学 一种基于人脸关键点的多神经网络级联识别人脸方法
CN107832741A (zh) * 2017-11-28 2018-03-23 北京小米移动软件有限公司 人脸特征点定位的方法、装置及计算机可读存储介质
CN107977618B (zh) * 2017-11-28 2021-05-11 上海交通大学 一种基于双层级联神经网络的人脸对齐方法
CN108304765B (zh) * 2017-12-11 2020-08-11 中国科学院自动化研究所 用于人脸关键点定位与语义分割的多任务检测装置
CN108121952B (zh) * 2017-12-12 2022-03-08 北京小米移动软件有限公司 人脸关键点定位方法、装置、设备及存储介质
CN109960974A (zh) * 2017-12-22 2019-07-02 北京市商汤科技开发有限公司 人脸关键点检测方法、装置、电子设备及存储介质
CN108509894A (zh) * 2018-03-28 2018-09-07 北京市商汤科技开发有限公司 人脸检测方法及装置
CN108319937A (zh) * 2018-03-28 2018-07-24 北京市商汤科技开发有限公司 人脸检测方法及装置
CN109325398B (zh) * 2018-06-30 2020-10-09 东南大学 一种基于迁移学习的人脸属性分析方法
CN109376684B (zh) * 2018-11-13 2021-04-06 广州市百果园信息技术有限公司 一种人脸关键点检测方法、装置、计算机设备和存储介质
CN111382642A (zh) * 2018-12-29 2020-07-07 北京市商汤科技开发有限公司 人脸属性识别方法及装置、电子设备和存储介质
CN109886107A (zh) * 2019-01-15 2019-06-14 北京奇艺世纪科技有限公司 眼部图像处理方法、设备、图像处理设备、介质
CN109829436B (zh) * 2019-02-02 2022-05-13 福州大学 基于深度表观特征和自适应聚合网络的多人脸跟踪方法
CN109948441B (zh) * 2019-02-14 2021-03-26 北京奇艺世纪科技有限公司 模型训练、图像处理方法、装置、电子设备及计算机可读存储介质
CN110276277A (zh) * 2019-06-03 2019-09-24 罗普特科技集团股份有限公司 用于检测人脸图像的方法和装置
CN111368685B (zh) * 2020-02-27 2023-09-29 北京字节跳动网络技术有限公司 关键点的识别方法、装置、可读介质和电子设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229293A (zh) * 2017-08-09 2018-06-29 北京市商汤科技开发有限公司 人脸图像处理方法、装置和电子设备
CN109635752A (zh) * 2018-12-12 2019-04-16 腾讯科技(深圳)有限公司 人脸关键点的定位方法、人脸图像处理方法和相关装置
CN112597944A (zh) * 2020-12-29 2021-04-02 北京市商汤科技开发有限公司 关键点检测方法及装置、电子设备和存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Facial landmark detection", BLOG.CSDN.NET/YEZITONG/ARTICLE/DETAILS/86177846, 9 January 2019 (2019-01-09), XP055947445, Retrieved from the Internet <URL:https://blog.csdn.net/YeziTong/article/details/86177846> *
ANONYMOUS: "Facial landmark detection", BLOG.CSDN.NET/YEZITONG/ARTICLE/DETAILS/86177846, 9 January 2019 (2019-01-09), XP055947447, Retrieved from the Internet <URL:https://blog.csdn.net/YeziTong/article/details/86177846> *
SUN YI; WANG XIAOGANG; TANG XIAOOU: "Deep Convolutional Network Cascade for Facial Point Detection", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOCIETY, US, 23 June 2013 (2013-06-23), US , pages 3476 - 3483, XP032493158, ISSN: 1063-6919, DOI: 10.1109/CVPR.2013.446 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789040A (zh) * 2024-02-28 2024-03-29 华南农业大学 一种扰动状态下的茶芽叶姿态检测方法
CN117789040B (zh) * 2024-02-28 2024-05-10 华南农业大学 一种扰动状态下的茶芽叶姿态检测方法

Also Published As

Publication number Publication date
CN112597944A (zh) 2021-04-02
CN112597944B (zh) 2024-06-11
TW202226049A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
TWI777162B (zh) 圖像處理方法及裝置、電子設備和電腦可讀儲存媒體
WO2022142298A1 (zh) 关键点检测方法及装置、电子设备和存储介质
WO2022179026A1 (zh) 图像处理方法及装置、电子设备和存储介质
WO2019214201A1 (zh) 活体检测方法及装置、系统、电子设备、存储介质
WO2021051857A1 (zh) 目标对象匹配方法及装置、电子设备和存储介质
US20190347823A1 (en) Method and apparatus for detecting living body, system, electronic device, and storage medium
WO2022179025A1 (zh) 图像处理方法及装置、电子设备和存储介质
TWI718631B (zh) 人臉圖像的處理方法及裝置、電子設備和儲存介質
WO2021036382A9 (zh) 图像处理方法及装置、电子设备和存储介质
CN108304506B (zh) 检索方法、装置及设备
CN109840917B (zh) 图像处理方法及装置、网络训练方法及装置
WO2022188305A1 (zh) 信息展示方法及装置、电子设备、存储介质及计算机程序
CN111241887A (zh) 目标对象关键点识别方法及装置、电子设备和存储介质
WO2022022350A1 (zh) 图像处理方法及装置、电子设备、存储介质和计算机程序产品
TWI752473B (zh) 圖像處理方法及裝置、電子設備和電腦可讀儲存媒體
CN111242303A (zh) 网络训练方法及装置、图像处理方法及装置
WO2023155532A1 (zh) 位姿检测方法及装置、电子设备和存储介质
WO2022193507A1 (zh) 图像处理方法及装置、设备、存储介质、程序和程序产品
WO2022134475A1 (zh) 点云地图构建方法及装置、电子设备、存储介质和程序
CN114445562A (zh) 三维重建方法及装置、电子设备和存储介质
WO2022121577A1 (zh) 图像处理方法及装置
US20220270352A1 (en) Methods, apparatuses, devices, storage media and program products for determining performance parameters
WO2023273498A1 (zh) 深度检测方法及装置、电子设备和存储介质
CN112613447A (zh) 关键点检测方法及装置、电子设备和存储介质
CN114581525A (zh) 姿态确定方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913049

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913049

Country of ref document: EP

Kind code of ref document: A1