CN114882305A

CN114882305A - Image key point detection method, computing device and computer-readable storage medium

Info

Publication number: CN114882305A
Application number: CN202210303055.4A
Authority: CN
Inventors: 甘启; 张璐; 陶明; 刘思远; 章子维
Original assignee: Shanghai Renyimen Technology Co ltd
Current assignee: Shanghai Renyimen Technology Co ltd
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-08-09

Abstract

The present disclosure provides an image keypoint detection method, a computing device and a computer-readable storage medium. The method comprises the following steps: determining a training sample set, wherein the training sample set comprises a plurality of training samples corresponding to a plurality of training images, and each training sample comprises a plurality of key points; performing first-stage training on the deep learning model by using one training sample in the training sample set to obtain a target vector of the training sample; determining gradient values of the deep learning model based on the target vectors of the training samples and a loss function of the deep learning model, wherein the loss function is set to have an adjustable loss weight for each key point; updating the deep learning model based on the gradient value and performing second-stage training on the deep learning model by using another training sample; and detecting the candidate image based on the trained deep learning model to output key points in the candidate image.

Description

Image key point detection method, computing device and computer-readable storage medium

Technical Field

The present invention relates generally to the field of machine learning, and more particularly, to an image keypoint detection method, a computing device, and a computer-readable storage medium.

Background

Currently, there are many applications in which it is desirable to detect keypoints in an image, including face keypoints, body keypoints, local (e.g., hand) keypoints, and so on. The key point detection refers to an algorithm for positioning a predefined key area position by giving an image with relevant semantics, and the algorithm has wide application scenes.

Taking face key point detection as an example, it refers to positioning key region positions of a face, including eyebrows, eyes, a nose, a mouth, a face contour, etc., given a face image, and since the face is affected by factors such as expressions, makeup, postures, and occlusion, face key point detection is also a challenging task. The face key point detection is an important basic link in a face recognition task, and the accurate face key point detection plays a key role in numerous scientific research and application topics, such as face posture correction, posture recognition, expression recognition, fatigue monitoring, mouth shape recognition and the like. Therefore, how to obtain high-precision face key points is a hot research problem in the fields of computer vision, pattern recognition, image processing and the like.

The keypoint detection methods are roughly divided into three types, namely a traditional method based on an ASM (Active Shape Model) and an AAM (Active application Model), a method based on cascade Shape regression, and a method based on deep learning. The most mainstream scheme at present is to perform regression of relevant key points based on a deep learning CNN network, that is, to design a relevant network structure, analyze picture information, and continuously adjust network parameters according to a large amount of labeled training data and a deviation between a network output result and labeled data by a back propagation method, so that the network output tends to label data, thereby learning semantic information of relevant key points, and achieving the purpose of predicting key points.

In the actual use process, all detected points do not have the same accuracy and use frequency (for example, points with large changes and high use rates in key points of a human face are mainly concentrated in the eye and mouth regions), however, the traditional training method does not distinguish different key points according to the importance and use frequency of each key point, but calculates corresponding deviations by giving the same weight to all key points. This may cause the attention of the model to be less concentrated, and when the accuracy of predicting the key points in the non-key attention area has reached the product requirement, a large amount of attention is still equally placed in the non-key attention area, so that the key points in the key attention area cannot converge to the optimal state.

Disclosure of Invention

In order to solve the above problems, in the model training process, after training is performed by using all the key points, the important key points can be additionally processed based on the importance of the key points in different key point regions, including screening special samples, increasing loss weight and/or performing additional training by using a local learning model to further refine the prediction accuracy of the key points.

According to an aspect of the present invention, there is provided an image keypoint detection method. The method comprises the following steps: determining a training sample set, wherein the training sample set comprises a plurality of training samples corresponding to a plurality of training images, each training sample comprises a plurality of key points, and each key point is associated with a corresponding importance label based on the importance of the key point region in which the key point is located in the training image; performing first-stage training on a deep learning model by using one training sample in the training sample set to obtain a target vector of the training sample; determining gradient values of the deep learning model based on a target vector of the training sample and a loss function of the deep learning model, wherein the loss function is set to have an adjustable loss weight for each keypoint; updating the deep learning model based on the gradient values and performing a second stage training on the deep learning model with another training sample; and detecting the candidate images based on the trained deep learning model to output key points in the candidate images.

According to another aspect of the invention, a computing device is provided. The computing device includes: at least one processor; and at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions when executed by the at least one processor causing the computing device to perform steps according to the above-described method.

According to yet another aspect of the present invention, a computer-readable storage medium is provided, having stored thereon computer program code, which, when executed, performs the method as described above.

In some embodiments, each keypoint is associated with a corresponding importance label based on the importance of the keypoint region in which the keypoint is located in the training image comprises: dividing the key points of a plurality of training samples in the training sample set into at least a first key point set and a second key point set according to the importance of the key point region where each key point is located; assigning a first importance label to a keypoint of the first set of keypoints; and assigning a second importance label to the keypoints in the second keypoint set, wherein the keypoint regions in which the keypoints in the second keypoint set are located have higher importance than the keypoint regions in which the keypoints in the first keypoint set are located, and the second importance label is larger than the first importance label.

In some embodiments, the second stage training of the deep learning model with another training sample comprises: determining whether a change value of the gradient values is less than a predetermined threshold value; in response to determining that the change value of the gradient values is less than the predetermined threshold, increasing the loss weight for the keypoints with the second importance label; and performing second-stage training on the deep learning model by using the other training sample based on the increased loss weight, wherein the other training sample is a result of the training sample in the first stage or a different training sample in the training sample set.

In some embodiments, determining the set of training samples further comprises: counting the distribution condition of the key points belonging to the same key point region in the second key point set; and dividing training samples in the training sample set into regular samples and special samples based on the distribution situation, wherein the number of the regular samples is greater than that of the special samples, wherein performing second-stage training on the deep learning model by using another training sample comprises: determining whether a change value of the gradient values is less than a predetermined threshold value; in response to determining that the change value of the gradient values is less than the predetermined threshold, selecting the special sample as the other training sample; and performing second-stage training on the deep learning model by using the special samples.

In some embodiments, the second stage training of the deep learning model with another training sample comprises: determining whether a change value of the gradient values is less than a predetermined threshold value; in response to determining that the change value of the gradient values is less than the predetermined threshold, selecting keypoints from the results of the first training stage that belong to the same keypoint region to construct the further training sample; and performing a second stage training on a local learning model using the other training sample, wherein the local learning model is smaller than the deep learning model.

In some embodiments, the first-stage training of the deep learning model with one of the set of training samples to obtain the target vector of the training sample comprises: inputting the training sample into a feature extraction layer of the deep learning model to extract a feature vector of the training sample; and inputting the feature vector into a fully connected layer of the deep learning model to obtain a target vector of the training sample.

Drawings

The invention will be better understood and other objects, details, features and advantages thereof will become more apparent from the following description of specific embodiments of the invention given with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of a system for implementing an image keypoint detection method according to an embodiment of the invention.

FIG. 2 illustrates a flow diagram of an image keypoint detection method according to some embodiments of the invention.

FIG. 3 illustrates a flow chart of steps for associating importance labels for keypoints according to some embodiments of the invention.

Fig. 4 shows a schematic diagram of the distribution of keypoints for a mouth region.

FIG. 5 is a flowchart illustrating the steps of obtaining a target vector for the training sample according to an embodiment of the present invention.

FIG. 6 shows a structural diagram of a deep learning model according to an embodiment of the invention.

FIG. 7 is a flowchart illustrating steps for a second stage training of a deep learning model according to one embodiment of the invention.

FIG. 8 is a flowchart illustrating steps for a second stage training of a deep learning model according to another embodiment of the present invention.

FIG. 9 is a flowchart illustrating steps of a second stage training of a deep learning model according to yet another embodiment of the present invention.

FIG. 10 illustrates a block diagram of a computing device suitable for implementing embodiments of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

In the following description, for the purposes of illustrating various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.

Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.

Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. Thus, the appearances of the phrase "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms first, second and the like used in the description and the claims are used for distinguishing objects for clarity, and do not limit the size, other order and the like of the described objects.

Fig. 1 shows a schematic diagram of a system 1 for implementing an image keypoint detection method according to an embodiment of the invention. As shown in fig. 1, the system 1 includes a user terminal 10, a computing device 20, a server 30, and a network 40. User terminal 10, computing device 20, and server 30 may interact with data via network 40. Here, each user terminal 10 may be a mobile or fixed terminal of an end user, such as a mobile phone, a tablet computer, a desktop computer, or the like. The user terminal 10 may communicate with the server 30, for example, through an application installed thereon, to transmit information to the server 30 and/or receive information from the server 30. For example, the user terminal 30 may send a captured photograph to the computing device 20 and/or the server 30 to enable the computing device 20 to perform keypoint detection on the photograph. The computing device 20 may perform corresponding operations based on data from the user terminal 10 and/or the server 30. The computing device 20 may include at least one processor 210 and at least one memory 220 coupled to the at least one processor 210, the memory 220 having stored therein instructions 230 executable by the at least one processor 210, the instructions 230, when executed by the at least one processor 210, performing at least a portion of the image keypoint detection method 100 as described below. Note that herein, computing device 20 may be part of server 30 or may be separate from server 30. The specific structure of computing device 20 or server 30 may be described, for example, in connection with FIG. 10, as follows.

FIG. 2 illustrates a flow diagram of an image keypoint detection method 100, according to some embodiments of the invention. The image keypoint detection method 100 can be performed, for example, by the computing device 20 or the server 30 in the system 1 shown in fig. 1. The image keypoint detection method 100 is described below in conjunction with fig. 1-10, for example, as being performed in the computing device 20. Note that, the image key point detection method 100 is described below by taking the detection of key points of a human face as an example, but it can be understood by those skilled in the art that the image key point detection method 100 described herein can be equally applied to other key point detection scenarios besides human faces, for example, to the detection of key points of an entire human body (e.g., detecting key points of a head, an elbow, a knee, etc. of a human body) or to the detection of key points of a local part of a human body (e.g., detecting key points of a wrist, a finger joint, a fingertip, etc. of a hand).

As shown in fig. 2, image keypoint detection method 100 includes step 110 in which computing device 20 may determine a set of training samples, where the set of training samples includes a plurality of training samples corresponding to a plurality of training images, each training sample may include a plurality of keypoints. Here, the computing device 20 may, for example, obtain a plurality of facial images from the user device 10 or a public or private facial database as training images, and label key points (also referred to as feature points) in the facial images as training samples. The number of face keypoints labeled may vary depending on the detection technique, for example, each face is labeled with 29 keypoints in the LFPW face database and the COFW face database, and each face is labeled with 21 keypoints … … in the AFLW face database the present invention is not limited to the specific face database used and the particular number of labeled keypoints. In addition, although the face key point detection is described as an example, those skilled in the art can understand that the idea of the present invention can be easily applied to other key point detection fields. For example, in the field of human body key point detection, key points including a crown, a neck, left and right palms, left and right elbows, left and right shoulders, left and right hips, left and right knees, left and right ankles, etc. may be detected, and the 14-point key point detection, the 25-point key point detection, etc. are commonly included.

Unlike conventional methods in the prior art, in the present invention, each keypoint is associated with a corresponding importance label based on the importance of the keypoint region in which it is located in the training image. For example, the computing device 20 may assign different importance labels to the keypoints in each keypoint region based on the importance level of the respective keypoint region of the face. In the invention, two-stage model training is used, in the first stage of training, the model is preliminarily trained without considering the importance of each key point, and after the model is preliminarily converged, the key points with higher importance are subjected to second-stage training according to the importance of each key point.

FIG. 3 illustrates a flow diagram of a process for associating importance tags for keypoints in step 110 according to some embodiments of the invention.

As shown in fig. 3, step 110 may include a sub-step 112 in which computing device 20 may divide the keypoints of a plurality of training samples in the set of training samples into at least a first set of keypoints and a second set of keypoints according to the importance of the keypoint region in which each keypoint is located. Taking face keypoint detection as an example, as described above, the keypoints of the eye region and the mouth region vary greatly and are of particular importance for applications such as expression recognition, mouth shape recognition, and the like. Therefore, the mouth region and the eye region may be set as the key point regions with higher importance, and the other regions may be set as the key point regions with lower importance. Accordingly, the computing device 20 may divide the keypoints of the keypoint regions of different degrees of importance into different sets of keypoints.

In sub-step 114, computing device 20 may assign a first importance label to the keypoints in the first set of keypoints and in sub-step 116, computing device 20 may assign a second importance label to the keypoints in the second set of keypoints.

Here, it is assumed that the importance of the keypoint region (e.g., the eye and mouth region) where the keypoint in the second keypoint set is located is higher than the importance of the keypoint region (the remaining region of the face) where the keypoint in the first keypoint set is located, and thus the second importance label is larger than the first importance label. For example, the second importance label may be set to 2 and the first importance label may be set to 1. In the second stage training, additional training or processing may be performed on training samples with higher importance labels.

Those skilled in the art will appreciate that the present invention is not so limited and that the keypoints in the training sample set may be divided into more keypoint sets. In one embodiment, the mouth region and the eye region may be set as the most important keypoint regions, the eyebrow region and the nose region may be set as the second most important keypoint regions, and the face contour region may be set as the least important keypoint region. In this case, the training sample set may be divided into three key point sets, and different importance labels may be assigned to the three key point sets, respectively. For example, the importance labels of the key points of the mouth region and the eye region may be set to 3, the importance labels of the training samples of the eyebrow region and the nose region may be set to 2, and the importance label of the training sample of the face contour region may be set to 1.

Further, since the expression of a person is mainly determined by key points in key point regions having higher importance, the positional relationship between these key points may vary greatly depending on the expressed expression. In conventional face detection, normal expressions of a front face are mainly detected, and detection of special expressions is lacking. While in applications such as social networking, interactive gaming, etc., special expressions such as ghosts, exaggerated expressions, etc., may often occur. Therefore, in the second stage training of the invention, additional training or processing can be performed on the training samples embodying the special expressions.

In this case, in step 110, the computing device 20 may also count the distribution of the keypoints belonging to the same keypoint region in the second keypoint set, and divide the training samples in the training sample set into regular samples and special samples based on the distribution, where the number of the regular samples is greater than that of the special samples.

Fig. 4 shows a schematic diagram of the distribution of keypoints for a mouth region. As shown in fig. 4, it is assumed that the key points 401 (upper lip middle point), 402 (lower lip middle point), 403 (left lip angle), 404 (right lip angle) of the mouth region are selected as the key points of the mouth region. The degree of lip of the face image may be determined according to the distance between the

key points

401 and 402 or the ratio of the distance between the

key points

401 and 402 to the distance between the

key points

403 and 404. Training samples having a degree of openness within a predetermined range are referred to as regular samples (i.e., representing normal expressions of the front face), and training samples having a degree of openness greater than the predetermined range are referred to as special samples (i.e., representing special expressions such as surprise, ghosty face, etc.). Similarly, the training samples can be further classified into a regular sample and a special sample according to the distribution of the key points of the eye region.

By further dividing the training samples into regular samples and special samples based on the distribution of the key points of the key point regions with higher importance, it is possible to train the model by increasing the proportion of the special samples in the second stage training, so that the model can detect more expressions, especially special expressions.

Continuing with FIG. 2, at step 120, computing device 20 may perform a first stage training of the deep learning model using one of the set of training samples obtained at step 110 to obtain a target vector for the training sample. Fig. 5 shows a flowchart of the step 120 of obtaining the target vector of the training sample according to an embodiment of the present invention. Fig. 6 shows a schematic structural diagram of a deep learning model 600 according to an embodiment of the invention. As shown in fig. 6, the deep learning model 600 may include an input layer 610, one or more feature extraction layers 620, one or more fully connected layers 630, and an output layer 640. Here, the deep learning model 600 may be any regression-type deep learning model, and the deep learning model 600 is described below by taking a Convolutional Neural Network (CNN) model as an example.

In the model training phase, the deep learning model 600 may input the training samples constructed as described above and the sample label of each training sample, hereinafter also referred to as an input vector, and output a target vector corresponding to each training sample. Here, the sample label is used to indicate the number of key points included in the training sample.

In the model usage stage, the deep learning model 600 may input candidate images and output a target vector for each candidate image. The deep learning model 600 is described below mainly in terms of a model training phase.

As shown in fig. 5, step 120 may include sub-step 122, where computing device 20 may input a training sample into feature extraction layer 620 through input layer 610 of deep learning model 600 to extract a feature vector of the training sample. Feature extraction layer 620 may have various different composition structures, such as a common convolution (Conv), Resnet Block, Mobilene Block, Ghostnet Block, SEnet Block, Shufflenet Block, and the like. For the CNN model, the feature extraction layer 620 may further include a convolutional layer, an excitation layer, and a pooling layer.

In one embodiment, at the convolution layer of feature extraction layer 620, the convolution output of the input vector may be obtained using convolution kernel calculations. For example, the convolution output may be expressed as:

X _conv ＝Conv(X _in ) (1)

wherein, X _in Is the input vector, i.e., the training sample, Conv () is a convolution kernel operation (i.e., normal convolution) on the input vector.

At the excitation layer of the feature extraction layer 620, the convolution output may be non-linearly mapped using an excitation function. The excitation function may include a Sigmoid function Sigmoid, a nonlinear activation function Relu, and the like. Taking Sigmoid function Sigmoid as an example, the nonlinear mapping outputs X _s Can be expressed as:

X _s ＝Sigmoid(X _conv ) (2)

wherein, the Sigmoid function can be expressed as:

next, at the pooling level of the feature extraction layer 620, X may be output for the non-linear mapping _s Performing pooling operations to reduce non-linear mapping output X _s Of (c) is calculated. For example, a maximum pooling operation or an average pooling operation may be selected. The pooling layer does not require additional parameters but merely reduces the dimension of the output of the previous layer while retaining most of its important information.

After processing by the one or more feature extraction layers 620, the feature vector X of the training sample is obtained _fea 。

Next, in sub-step 124, computing device 20 may assign the feature vector X _fea Input the fully-connected layer 630 to obtain the target vector X of the training sample _out 。

In one embodiment, the nonlinear activation function Relu may be used to apply the feature vector X output by the feature extraction layer 620 _fea Operate to obtain the target vector X _out . For example, the target vector X _out Can be expressed as:

X _out ＝Relu(X _fea ) (4)

thus, the deep learning model 600 is trained once by using one training sample, which is equivalent to the whole model

X _out ＝W*X _in +b (5)

Where W is the weight function of the deep learning model 600, b is the bias function of the deep learning model 600, and the purpose of training the deep learning model 600 is to continuously update the weight function W and the bias function b to a convergence value. Here, the initial value of the weight function W may be arbitrarily set, or may be empirically set.

Continuing with FIG. 2, next, at step 130, computing device 20 may determine gradient values for deep learning model 600 based on the target vectors of the training samples and the Loss function Loss of deep learning model 600. In a conventional training model, the contribution of the loss value of each keypoint to the loss function of the model is the same, i.e. each keypoint has no separate loss weight (equivalent to the loss weight of each keypoint being 1). In contrast, in the present invention, the Loss function Loss is set to have a corresponding Loss weight for the keypoints of each keypoint region, and the Loss function is adjustable.

In one embodiment, the Loss function Loss of the deep learning model 600 may be expressed as:

wherein L is _i Represents the loss value of the ith key point, a _i The loss weight of the ith keypoint is represented, and n represents the number of keypoints in a training sample. In one example, the loss value L _i Can be expressed as the difference between the training output value for this keypoint i and the sample label.

In the first stage training of the model, the loss weight a of each key point can be weighted _i Set to the same value, e.g., both 1, and the deep learning model 600 is trained according to conventional model training methods.

In some embodiments of the invention, a back propagation algorithm is used to weight the function W for each layer of the deep learning model 600 ^m And a bias function b ^m (M-1, 2, … … M), where M is the number of layers of the deep learning model 600. Thus, in step 130, the weighting function W may be based on the Loss function Loss and the Mth layer ^M And a bias function b ^M To determine the gradient value of the last layer (mth layer) of the deep learning model 600

And

and updates the weighting function for each of the plurality of layers of the deep learning model 600 based on the gradient value of the last layer.

Specifically, the gradient of the mth layer of the deep learning model 600 may be based on any one of a Batch (Batch), mini-Batch (mini-Batch), or random gradient descent method

And

the gradients of the M-1 st layer, the M-2 nd layer and the … … 1 st layer are determined in turn, and the weight function W of each layer is used for the layer ^m (and bias function b ^m ) And (6) updating.

The operation of step 130 is repeated based on the preset iteration step size until the preliminary convergence condition is reached. Here, the preliminary convergence condition may be that a difference between gradient values of consecutive two iterations (i.e., a change value of the gradient values) is less than a predetermined threshold value. After determining that the deep learning model 600 reaches the preliminary convergence condition, the first stage training of the model is completed, and the second stage training of the model may be started.

As shown in fig. 2, at step 140, computing device 20 may update deep learning model 600 based on the gradient values determined at step 130 and perform a second stage training of deep learning model 600 with another training sample. Step 140 may have different implementation flows depending on different implementations of the second stage training.

FIG. 7 shows a flowchart of the step 140 of performing the second stage training of the deep learning model 600 according to one embodiment of the present invention.

As shown in fig. 7, step 140 may include a substep 142 in which computing device 20 may determine whether the change value of the gradient values determined in step 130 (i.e., the change value of the gradient values relative to the gradient values of the previous iteration) is less than a predetermined threshold. The predetermined threshold may be set differently depending on the task being performed, the scale of the input data, and the design of the penalty function, among other factors.

If it is determined that the change value of the gradient value is less than the predetermined threshold value (yes judgment in substep 142), i.e., it is determined that the deep learning model 600 has reached the preliminary convergence condition of the first stage training, then in substep 144, the loss weight a of the keypoint having the second importance label is increased _i To update the Loss function Loss of the deep learning model 600. For example, in the first stage training, the loss weights of all the keypoints i are set to 1, and after the preliminary convergence condition is reached, the loss weights of the more important keypoints (e.g., the keypoints of the eye region and the mouth region) can be increased to 2. Alternatively, in the case of dividing the importance of the key points into three levels, the loss weight of the key points of the mouth region and the eye region may be set to 3, the loss weight of the key points of the eyebrow region and the nose region may be set to 2, and the loss weight of the key points of the face contour region may still be maintained to 1.

Then, in sub-step 146, the deep learning model 600 is second-stage trained using another training sample based on the loss weight increased in sub-step 144. Here, the other training sample may be a result of the training sample after the first-stage training, or may be a different training sample selected from a set of training samples. In the former case, the two stages of training utilize the same training image, and therefore less information is lost.

On the other hand, if it is determined that the change value of the gradient value is greater than or equal to the predetermined threshold value ("no" determination of sub-step 142), i.e., it is determined that the deep learning model 600 does not reach the preliminary convergence condition for the first stage training, then, in sub-step 148, it may return to step 120 to continue the first stage training of the deep learning model 600.

In this implementation, the second stage training is substantially the same as the first stage training process described above, except that the second training stage has key points with greater loss weight than the first training stage.

FIG. 8 shows a flowchart of the step 140 of performing a second stage training of the deep learning model 600 according to another embodiment of the present invention. The difference from the embodiment shown in fig. 7 is that, in the second stage training shown in fig. 8, the key points of the highlight region are trained by increasing the proportion of the special samples as training samples as described above.

As shown in fig. 8, step 140 may include sub-step 142', wherein computing device 20 may determine whether the change value of the gradient value determined in step 130 (i.e., the change value of the gradient value relative to the gradient value of the previous iteration) is less than a predetermined threshold value.

If it is determined that the change value of the gradient value is less than the predetermined threshold value (the determination of substep 142 'is "yes"), i.e., it is determined that the deep learning model 600 has reached the preliminary convergence condition of the first stage training, then in substep 144', a special sample as described above is selected as the other training sample.

Then, in sub-step 146 ', the deep learning model 600 is second-stage trained using the particular samples selected in sub-step 144'.

On the other hand, if it is determined that the change value of the gradient value is greater than or equal to the predetermined threshold value (the determination of sub-step 142 'is no), that is, it is determined that the deep learning model 600 does not reach the preliminary convergence condition for the first-stage training, then, at sub-step 148', it may return to step 120 to continue the first-stage training of the deep learning model 600.

In this implementation, the second stage training is substantially the same as the first stage training process as described above, except that the training samples used in the second training stage are only special samples, or the ratio of special samples is greater than that in the first training stage (the training samples in the first training stage may be used according to a natural distribution of the training samples, i.e., the training samples are not particularly chosen).

FIG. 9 shows a flowchart of the step 140 of performing the second stage training of the deep learning model 600 according to yet another embodiment of the present invention. The difference from the embodiment shown in fig. 7 and 8 is that in the second stage training shown in fig. 9, a separate training model (sub-model) is used for training the key points of the important region.

As shown in fig. 9, step 140 may include sub-step 142 "in which computing device 20 may determine whether the change value of the gradient value determined in step 130 (i.e., the change value of the gradient value relative to the gradient value of the previous iteration) is less than a predetermined threshold.

If it is determined that the change in the gradient values is less than the predetermined threshold ("yes" decision in substep 142), i.e., it is determined that the deep learning model 600 meets the preliminary convergence criteria for the first stage training, then, in substep 144 ", keypoints belonging to the same keypoint region are selected from the results of the first training stage to construct the further training sample. For example, key points of the mouth region and/or eye region may be individually selected to construct training samples for the second stage training.

Then, in sub-step 146 ", the local learning model may be second-stage trained using another training sample constructed in sub-step 144". Here, the local learning model may have a similar structure to the deep learning model 600, but a size smaller than the deep learning model 600.

For example, in the case where the mouth region and the eye region are set as the key regions as described above, the second-stage training may include constructing different second-stage training samples based on the key points of the mouth region, the left eye region, and the right eye region, respectively, and training these second-stage training samples based on different local learning models, respectively, for example, parallel training may be performed.

On the other hand, if it is determined that the change value of the gradient value is greater than or equal to the predetermined threshold value ("no" determination of substep 142 "), i.e., it is determined that the deep learning model 600 does not reach the preliminary convergence condition for the first stage training, then, at substep 148", it may return to step 120 to continue the first stage training of the deep learning model 600.

In this implementation, the second-stage training is substantially the same as the first-stage training process as described above, except that a separate training sample is constructed for only key points of the important region and a separate local learning model is used for training in the second training stage, i.e., this implementation cascades a separate local learning model on the basis of the deep learning model 600 shown in fig. 6. In some contexts, the combination of the deep learning model 600 shown in fig. 6 and the local learning model described in connection with fig. 9 is also referred to as the deep learning model of system 1.

Note that while various embodiments of step 140 are described above in connection with fig. 7-9, respectively, one skilled in the art will appreciate that these embodiments may be combined in any manner. For example, in the case of combining the embodiments of fig. 7 and 8, in sub-step 144/144 ', a special sample may be selected as another training sample and the loss weights of keypoints with second importance labels are increased, and in sub-step 146/146', the deep learning model 600 is second-stage trained based on the increased loss weights and the special sample.

Similarly, the second stage training of step 140 above may be repeated based on a preset iteration step size until a maximum number of iterations is reached or a stop iteration threshold is reached. To this end, the weighting function of the deep learning model 600 (or its combination with the respective local learning models) is trained to a convergence value, which can be used to detect keypoints in new candidate images.

Continuing with FIG. 2, at step 150, computing device 20 may detect a candidate image based on the trained deep learning model to output keypoints in the candidate image.

Here, step 150 may have different implementations depending on different implementations of the second stage training in step 140.

Using the implementation shown in fig. 7 and/or fig. 8, the deep learning model 600 directly outputs the detected keypoints in the candidate images.

Using the implementation shown in fig. 9, the output of the deep learning model 600 is candidate keypoints (or preliminary keypoints) of the candidate image, which may be further predicted by the local learning model to output more accurate keypoints.

FIG. 10 illustrates a block diagram of a computing device 1000 suitable for implementing embodiments of the present invention. Computing device 1000 may be, for example, computing device 20 or server 30 as described above.

As shown in fig. 10, computing device 1000 may include one or more Central Processing Units (CPUs) 1010 (only one shown schematically) that may perform various suitable actions and processes in accordance with computer program instructions stored in Read Only Memory (ROM)1020 or loaded from storage unit 1080 into Random Access Memory (RAM) 1030. In the RAM 1030, various programs and data required for the operation of the computing device 1000 may also be stored. The CPU 1010, ROM 1020, and RAM 1030 are connected to each other via a bus 1040. An input/output (I/O) interface 1050 is also connected to bus 1040.

A number of components in computing device 1000 are connected to I/O interface 1050, including: an input unit 1060 such as a keyboard, a mouse, or the like; an output unit 1070 such as various types of displays, speakers, and the like; a storage unit 1080, such as a magnetic disk, optical disk, or the like; and a communication unit 1090 such as a network card, modem, wireless communication transceiver, or the like. A communication unit 1090 allows computing device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The image keypoint detection method 100 described above may be performed, for example, by the CPU 1010 of a computing device 1000, such as computing device 20 or server 30. For example, in some embodiments, image keypoint detection method 100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1080. In some embodiments, part or all of the computer program may be loaded and/or installed onto computing device 1000 via ROM 1020 and/or communication unit 1090. When the computer program is loaded into RAM 1030 and executed by CPU 1010, one or more operations of image keypoint detection method 100 described above may be performed. Further, the communication unit 1090 may support wired or wireless communication functions.

Those skilled in the art will appreciate that the computing device 1000 illustrated in FIG. 10 is merely illustrative. In some embodiments, computing device 20 or server 30 may contain more or fewer components than computing device 1000.

The image keypoint detection method 100 and the computing device 1000 that can be used as the computing device 20 or the server 30 according to the present invention are described above with reference to the drawings. However, it will be understood by those skilled in the art that the steps of the image keypoint detection method 100 are not limited to being performed in the order shown in the figures and described above, but may be performed in any other reasonable order. Further, the computing device 1000 also need not include all of the components shown in FIG. 10, it may include only some of the components necessary to perform the functions described in the present invention, and the manner in which these components are connected is not limited to the form shown in the figures.

The present invention may be methods, apparatus, systems and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.

In one or more exemplary designs, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The units of the apparatus disclosed herein may be implemented using discrete hardware components, or may be integrally implemented on a single hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the invention may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.

The previous description of the invention is provided to enable any person skilled in the art to make or use the invention. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the present invention is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image keypoint detection method comprising:

determining a training sample set, wherein the training sample set comprises a plurality of training samples corresponding to a plurality of training images, each training sample comprises a plurality of key points, and each key point is associated with a corresponding importance label based on the importance of the key point region in which the key point is located in the training image;

performing a first-stage training on a deep learning model by using one training sample in the training sample set to obtain a target vector of the training sample;

determining gradient values of the deep learning model based on a target vector of the training sample and a loss function of the deep learning model, wherein the loss function is set to have an adjustable loss weight for each keypoint;

updating the deep learning model based on the gradient values and performing a second stage training on the deep learning model with another training sample; and

and detecting the candidate image based on the trained deep learning model to output the key points in the candidate image.

2. The method of claim 1, wherein each keypoint is associated with a corresponding importance label based on the importance of the keypoint region in which the keypoint is located in the training image comprises:

dividing the key points of a plurality of training samples in the training sample set into at least a first key point set and a second key point set according to the importance of the key point region where each key point is located;

assigning a first importance label to a keypoint of the first set of keypoints; and

and assigning a second importance label to the key points in the second key point set, wherein the importance of the key point region in which the key points in the second key point set are located is higher than that of the key point region in which the key points in the first key point set are located, and the second importance label is larger than the first importance label.

3. The method of claim 2, wherein second-stage training the deep learning model with another training sample comprises:

determining whether a change value of the gradient values is less than a predetermined threshold value;

in response to determining that the change value of the gradient values is less than the predetermined threshold, increasing the loss weight for the keypoints with the second importance label; and

performing a second stage training of the deep learning model with the other training sample based on the increased loss weight, wherein the other training sample is a result of the training sample in the first stage or a different training sample in the set of training samples.

4. The method of claim 2, wherein determining a set of training samples further comprises:

counting the distribution condition of the key points belonging to the same key point region in the second key point set; and

dividing training samples in the training sample set into regular samples and special samples based on the distribution condition, wherein the number of the regular samples is greater than that of the special samples, wherein performing second-stage training on the deep learning model by using another training sample comprises:

in response to determining that the change value of the gradient values is less than the predetermined threshold, selecting the special sample as the other training sample; and

and performing second-stage training on the deep learning model by using the special samples.

5. The method of claim 2, wherein second-stage training the deep learning model with another training sample comprises:

in response to determining that the change value of the gradient values is less than the predetermined threshold, selecting keypoints from the results of the first training stage that belong to the same keypoint region to construct the further training sample; and

performing a second stage training on a local learning model using the another training sample, wherein the local learning model is smaller than the deep learning model.

6. The method of claim 1, wherein performing a first stage training on a deep learning model using one of the set of training samples to obtain a target vector for the training sample comprises:

inputting the training sample into a feature extraction layer of the deep learning model to extract a feature vector of the training sample; and

inputting the feature vector into a fully connected layer of the deep learning model to obtain a target vector of the training sample.

7. A computing device, comprising:

at least one processor; and

at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions when executed by the at least one processor causing the computing device to perform the steps of the method of any of claims 1-6.

8. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 6.