CN114299427A

CN114299427A - Method and device for detecting key points of target object, electronic equipment and storage medium

Info

Publication number: CN114299427A
Application number: CN202111593989.8A
Authority: CN
Inventors: 耿淼
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-08

Abstract

The disclosure relates to a method and a device for detecting key points of a target object, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring the similarity between adjacent first video frames and second video frames; when the similarity is smaller than a preset threshold value, calling an object detection model to determine a target object area in the second video frame; when the similarity is greater than or equal to a preset threshold, determining a target object area in the second video frame based on the detected target object key points of the first video frame, and further detecting the target object key points in the second video frame based on the target object area. According to the scheme, the target object region can be determined through the object detection model when the similarity is low, frequent calling is avoided, the power consumption of the terminal is reduced, the target object region is determined based on the key point of the previous video frame when the similarity is high, the key point of the previous frame is prevented from being used mistakenly, the identification accuracy is effectively improved, and balance between the power consumption of the terminal and the identification accuracy of the key point is realized.

Description

Method and device for detecting key points of target object, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for detecting key points of a target object, an electronic device, and a storage medium.

Background

With the development of computer technology, human body key point detection technology is used as a bottom-layer algorithm support, and plays an increasingly important role. Taking video editing as an example, different parts of a human body in the video are identified through a human body key point detection technology, and a user can beautify the specified part of the human body in the video, such as the thickness of legs or a waist, so that a better video effect is provided.

In the related technology, aiming at human body beautification in the video editing process, the human body key points in each video frame can be identified one by one; or selecting video frames according to a preset interval to identify key points of the human body.

However, in the above method, the former has a high requirement on the computing power of the processing device, and the latter can reduce the amount of computation to some extent, but has a problem that the detection of the human body key point is not accurate. Therefore, in the related technology, when the key points of the human body are identified, the identification accuracy and the performance of the processing equipment are difficult to be considered.

Disclosure of Invention

The present disclosure provides a method and an apparatus for detecting a key point of a target object, an electronic device, and a storage medium, so as to at least solve a problem in the related art that it is difficult to achieve both recognition accuracy and processing device performance when recognizing a key point. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, a method for detecting key points of a target object is provided, including:

acquiring the similarity between adjacent first video frames and second video frames; the video frame sequence of the first video frame is before the second video frame, and the second video frame is the current video frame to be detected;

when the similarity is smaller than a preset threshold value, calling an object detection model to determine a target object area in the second video frame;

when the similarity is larger than or equal to the preset threshold, determining a target object area in the second video frame based on the detected target object key points of the first video frame;

and detecting a target object key point in the second video frame based on the target object area determined from the second video frame.

In an exemplary embodiment, the detecting a target object key point in the second video frame based on the target object region determined from the second video frame includes:

inputting the image corresponding to the target object region into a key point detection model, detecting the probability that each pixel point in the image is a key point of the target object through the key point detection model, and determining the key point of the target object in the image based on the probability.

In an exemplary embodiment, the keypoint detection model is obtained based on:

acquiring a training image containing a target object and a real key point corresponding to the target object in the training image; the real key points comprise a first class key point and a second class key point; the first class of key points comprises key points of a limb part, and the second class of key points comprises key points of a body part;

inputting the training image into a neural network model to be trained, identifying the probability that each pixel point in the training image is the target object key point through the neural network model, and determining a plurality of prediction key points based on the probability;

determining a current loss function of the neural network model through a supervision module preset in the neural network model; the supervision module determines the loss function based on a first loss function corresponding to the real key points and the predicted key points and a second loss function corresponding to a reference vector and a current vector, wherein the reference vector is a vector representing the relative position between the first type of key points and the second type of key points in the real key points, and the current vector represents a vector representing the relative position between the predicted first type of key points and the predicted second type of key points in the plurality of predicted key points;

adjusting parameters of the neural network model based on the loss function until a training end condition is met to obtain a trained neural network model;

and deleting the supervision module from the trained neural network model to obtain the key point detection model.

In an exemplary embodiment, the neural network model comprises a plurality of serially connected network modules for outputting corresponding thermodynamic diagrams for input images; the corresponding response value of each pixel point in the thermodynamic diagram represents the probability that the pixel point is a key point of a target object;

the inputting the training image into a neural network model to be trained so as to identify the probability that each pixel point in the training image is the key point of the target object through the neural network model comprises:

taking a feature map corresponding to the training image as an input image of a first network module in the neural network model; taking the thermodynamic diagram output by the previous network module as an input image of the next network module;

and determining the probability that each pixel point in the training image is the key point of the target object based on the thermodynamic diagram output by the last network module.

In an exemplary embodiment, each network module is correspondingly provided with a monitoring module;

the determining, by a preset supervision module, a current loss function of the neural network model based on a reference vector and the current vector includes:

determining current vectors corresponding to a plurality of prediction key points in a thermodynamic diagram output by each network module, and determining a local loss function corresponding to the network module based on a reference vector and the current vector corresponding to the network module through a supervision module corresponding to the network module;

determining a loss function of the neural network model based on the local loss functions corresponding to the plurality of network modules respectively.

In an exemplary embodiment, the obtaining the similarity between the adjacent first video frame and the second video frame includes:

acquiring a first gray scale image corresponding to a first video frame, and determining first distribution information corresponding to pixel values in the first video frame based on the first gray scale image;

acquiring a second gray map corresponding to a second video frame, and determining second distribution information corresponding to pixel values in the second video frame based on the second gray map;

and determining the similarity between the pair of video frames and the second video frame according to the similarity of the first distribution information and the second distribution information.

In an exemplary embodiment, the determining a target object region in the second video frame based on the target object keypoints detected by the first video frame comprises:

acquiring a target object key point corresponding to the first video frame, and determining an external frame corresponding to the target object key point of the first video frame;

and determining a target object area in the second video frame according to the area of the external frame in the first video frame.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for detecting a key point of a target object, including:

a similarity acquisition unit configured to perform acquisition of a similarity between adjacent first and second video frames; the video frame sequence of the first video frame is before the second video frame, and the second video frame is the current video frame to be detected;

a first target object identification unit configured to execute, when the similarity is smaller than a preset threshold, calling an object detection model to determine a target object region in the second video frame;

a second target object identification unit configured to perform, when the similarity is greater than or equal to the preset threshold, determining a target object region in the second video frame based on the target object key points detected by the first video frame;

a key point detection unit configured to perform detection of a target object key point in the second video frame based on the target object region determined from the second video frame.

In an exemplary embodiment, the key point detecting unit includes:

and the image input module is configured to input an image corresponding to the target object region into a key point detection model, detect the probability that each pixel point in the image is a key point of the target object through the key point detection model, and determine the key point of the target object in the image based on the probability.

In an exemplary embodiment, the keypoint detection model is obtained based on:

the training image acquisition unit is configured to acquire a training image containing a target object and real key points corresponding to the target object in the training image; the real key points comprise a first class key point and a second class key point; the first class of key points comprises key points of a limb part, and the second class of key points comprises key points of a body part;

a training image input unit configured to perform input of the training image to a neural network model to be trained, to identify probabilities that respective pixel points in the training image are the target object key points by the neural network model, and to determine a plurality of prediction key points based on the probabilities;

a loss function determination unit configured to perform determination of a current loss function of the neural network model by a supervision module preset in the neural network model; the supervision module determines the loss function based on a first loss function corresponding to the real key points and the predicted key points and a second loss function corresponding to a reference vector and a current vector, wherein the reference vector is a vector representing the relative position between the first type of key points and the second type of key points in the real key points, and the current vector represents a vector representing the relative position between the predicted first type of key points and the predicted second type of key points in the plurality of predicted key points;

the neural network model obtaining unit is configured to adjust parameters of the neural network model based on the loss function until a training end condition is met, so that a trained neural network model is obtained;

a supervision module deleting unit configured to perform deletion of the supervision module from the trained neural network model, resulting in the key point detection model.

the training image input unit includes:

an image prediction module configured to perform a process of taking a feature map corresponding to the training image as an input image of a first network module in the neural network model; taking the thermodynamic diagram output by the previous network module as an input image of the next network module;

and the key point probability determination module is configured to execute a thermodynamic diagram output by the last network module to determine the probability that each pixel point in the training image is the key point of the target object.

the loss function determination unit includes:

the local loss function determining module is configured to execute the steps of determining current vectors corresponding to a plurality of prediction key points in the thermodynamic diagram output by each network module, and determining the local loss function corresponding to the network module based on the reference vector and the current vector corresponding to the network module through the supervision module corresponding to the network module;

a loss function calculation module configured to perform determining a loss function of the neural network model based on the local loss function corresponding to each of the plurality of network modules.

In an exemplary embodiment, the similarity obtaining unit includes:

the distribution information acquisition module is configured to acquire a first gray scale map corresponding to a first video frame and determine first distribution information corresponding to pixel values in the first video frame based on the first gray scale map;

the second distribution information acquisition module is configured to acquire a second gray scale map corresponding to a second video frame and determine second distribution information corresponding to pixel values in the second video frame based on the second gray scale map;

a distribution information comparison module configured to perform determining a similarity between the pair of video frames and the second video frame according to a similarity of the first distribution information and the second distribution information.

In an exemplary embodiment, the second target object recognition unit includes:

the external frame determining module is configured to execute acquiring a target object key point corresponding to the first video frame and determine an external frame corresponding to the target object key point of the first video frame;

a target object region determination module configured to perform determining a target object region in the second video frame according to a region of the circumscribing box in the first video frame.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of detecting a target object keypoint as defined in any of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method for detecting a target object keypoint as described in any one of the above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which includes instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method for detecting a target object keypoint as described in any one of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the scheme, the target object region is determined through the object detection model when the similarity is smaller than the preset threshold, frequent calling of the object detection model is avoided, a large number of operation resources are saved, the power consumption of the terminal is reduced, the target object region is determined based on the key points of the previous video frame when the similarity exceeds the preset threshold, the key points of the previous frame are prevented from being used mistakenly, the identification accuracy is effectively improved, and balance between the power consumption of the terminal and the identification accuracy is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flowchart illustrating a method for detecting key points of a target object according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method for training a keypoint detection model, according to an example embodiment.

FIG. 3 is a block diagram illustrating a keypoint detection model according to an exemplary embodiment.

Fig. 4 is a block diagram illustrating an apparatus for detecting key points of a target object according to an exemplary embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

Fig. 1 is a flowchart illustrating a method for detecting a target object key point according to an exemplary embodiment, and as shown in fig. 1, the method is applied to a terminal for example, it is understood that the method may also be applied to a server, and may also be applied to a system in which a terminal interacts with a server. Specifically, the following steps may be included.

In step S110, a similarity between adjacent first and second video frames is acquired.

The first video frame and the second video frame may be video frames in a video to be processed, the video frame sequence of the first video frame is before the second video frame, and the second video frame is a video frame to be currently detected. The video frame sequence may be a playing sequence corresponding to a plurality of video frames, or may be a preset detection sequence.

In practical application, when detecting a key point of a target object in a video to be processed, a plurality of video frames in the video to be processed can be detected based on a video frame sequence. Specifically, after the first video frame is detected, a second video frame to be detected adjacent to the first video frame may be further acquired, and the similarity between the first video frame and the second video frame may be determined.

In step S120, when the similarity is smaller than a preset threshold, calling an object detection model to determine a target object region in the second video frame;

as an example, the target object may be a human being, or may be other objects to be detected for the keypoint, such as an object of an animal, a building, a machine, or a landscape. The target object region may be an image region of the target object in the second video frame.

The object detection model may be a model that determines a region where a target object in the image is located after analyzing image data corresponding to the input image, and the object detection model may independently identify the target object in the image based on the input image.

After the similarity between the first video frame and the second video frame is obtained, the similarity may be compared with a preset threshold value to determine whether the second video frame to be detected needs to call the object detection model.

In the related art, taking a target object as an example, when a human body key point is detected, frame-by-frame detection can be performed on a video. Specifically, during frame-by-frame detection, a human detector is used to detect the position of a human body in each video frame, so that when a shooting scene changes, the position of the human body in the current frame is accurately identified. However, although the position of the human body in each video frame can be properly identified, the method has a high computational power requirement on the terminal device, and when a user edits the video on the terminal device with poor performance, the situation that the video is jammed or even cannot be normally processed easily occurs.

In the embodiment of the present disclosure, the similarity between the first video frame and the second video frame may be obtained, when the similarity is smaller than a preset threshold, it may be determined that an image appearing in the first video frame and the second video frame has a large difference, and when the similarity is low, an image corresponding to the first video frame and the second video frame has a possibility of changing in a scene, so that the object detection model may be invoked at this time to determine the target object region in the second video frame.

In an example, a person skilled in the art may select an object detection model as needed, for example, when a target object is determined by anchor-base (anchor-base), an ssd (single Shot multi box detector) algorithm with fast operation speed and convenient mobile terminal deployment in a single stage may be used for detection, or a two-stage fast-rcnn algorithm may be used for detection. Of course, the target object may also be determined in an anchor-independent manner (anchor-free), such as the centernet algorithm.

In step S130, when the similarity is greater than or equal to the preset threshold, a target object region in the second video frame is determined based on the target object key points detected by the first video frame.

In the related art, the video frame detection may be further extracted at a predetermined interval, that is, the video frames are extracted at a fixed interval, for example, at intervals of 5 frames or more than 5 frames, and after the video frames are extracted, the target object key points in the currently extracted video frame may be determined by using the target object key points in the previous video frame. Although the method does not need to identify the position of a human body in each video frame, the data processing pressure of the terminal equipment can be relieved to a certain extent, and the requirement on the calculation capacity of the terminal equipment is reduced, when scene mutation occurs between video frames, the problem that the object detection model cannot be triggered timely and continues to be used for detecting key points of a target object in the previous video frame exists, so that the key point identification is inaccurate is caused. Or, when the positions of the target objects in the plurality of video frames are stable, the object detection model is triggered by mistake to increase the power consumption of the device, thereby causing resource waste.

In the present disclosure, the similarity between the first video frame and the second video frame may be compared with a preset threshold, when the similarity is smaller than the threshold, the object detection model is called in time to determine the target object region in the second video frame, so as to avoid continuing to use the information in the first video frame, and when the similarity is larger than the threshold, the object detection model is not triggered, but the target object region in the second video frame is determined based on the detected target object key point corresponding to the first video frame.

Specifically, the first video frame may be a video frame in which a target object key point is detected, and after detecting a next video frame of the first video frame and acquiring a similarity between the first video frame and the second video frame, if it is determined that the similarity is greater than or equal to a preset threshold, it may be determined that scene or image contents of the first video frame and the second video frame are similar to each other, so that a target object region in the second video frame may be determined based on the target object key point detected by the first video frame without invoking an object detection model.

In step S140, a target object key point in the second video frame is detected based on the target object region determined from the second video frame.

As an example, the target object keypoints may be points used in video processing or image processing to identify specific parts of the target object. For example, when the target object is a human body, the target object key points may be points for recognizing different parts of the human body.

After determining the target object region in the second video frame, the second video frame target object keypoints may be detected based on the target object region determined in the second video frame. Specifically, the image corresponding to the target object region in the second video frame may be identified, for example, a key point identification algorithm or a model is called to determine the key point of the target object in the image corresponding to the target object region.

In the method for detecting the target object key points, the similarity between the adjacent first video frame and the second video frame may be acquired, when the similarity is smaller than a preset threshold, an object detection model is called to determine a target object region in the second video frame, when the similarity is greater than or equal to the preset threshold, the target object region in the second video frame is determined based on the detected target object key points in the first video frame, and then the target object key points in the second video frame may be detected based on the determined target object region in the second video frame. According to the scheme, the target object region is determined through the object detection model when the similarity is smaller than the preset threshold, frequent calling of the object detection model is avoided, a large number of operation resources are saved, the power consumption of the terminal is reduced, the target object region is determined based on the key points of the previous video frame when the similarity exceeds the preset threshold, the key points of the previous frame are prevented from being used mistakenly, the identification accuracy is effectively improved, and balance between the power consumption of the terminal and the identification accuracy of the key points is achieved.

In an exemplary embodiment, whether the similarity is greater than a preset threshold value or not can be used as a unique condition for whether the object detection model is called or not, and by using the similarity as the unique condition, the complexity of an algorithm structure can be greatly reduced, the power consumption of terminal equipment is reduced while the deployment is facilitated, and the operation speed of video editing of the mobile terminal is increased.

In an exemplary embodiment, in step S140, the detecting a target object key point in the second video frame based on the target object region determined from the second video frame may include the following steps:

As an example, the keypoint detection model may be a model for identifying keypoints of a target object in an image.

In practical application, after the target object region in the second video frame is determined, the image corresponding to the target object region in the second video frame may be input to a pre-trained key point detection model, and then key point detection may be performed on the image through the key point detection model, so as to determine the probability that each pixel point in the image is a key point of the target object, and determine the key point of the target object in the image based on the probability.

Specifically, different types of key points of the target object may exist on the same target object, for example, a human body, the key points may include key points corresponding to a plurality of human body parts such as a shoulder, a hand, and a leg, and for each type of key point of the target object, after obtaining a probability that each pixel point is a key of the target object, one or more pixel points whose corresponding probability is higher than a preset probability threshold may be determined as the key points of the target object.

According to the method and the device, the probability that each pixel point in the image is the key point of the target object can be determined through the key point detection model, the key point of the target object is identified based on the probability, and the key point of the target object in the image is accurately identified.

In an exemplary embodiment, as shown in fig. 2, the keypoint detection model may be obtained based on the following:

in step S210, a training image including a target object and a real key point corresponding to the target object in the training image are obtained.

As an example, the real keypoints may be target object keypoints pre-labeled for the training image, and the real keypoints may include a first class of keypoints and a second class of keypoints. The first category of key points may include key points of a limb part, and the limb part may include at least one of a left upper limb, a right upper limb, a left lower limb, and a right lower limb, for example, the first category of key points may be key points on a lower position: left wrist, left elbow, left shoulder, right wrist, right elbow, right shoulder, left knee, left ankle, right knee, right ankle, and the like. And the second type of keypoints may include keypoints of torso parts, such as keypoints on the left crotch or the right crotch.

In a specific implementation, a training image including a target object may be obtained, and a real key point corresponding to the target object in the training image may be determined. Specifically, when image recognition is performed in a top-down (top-down) manner, that is, when a single target object is used as a model input, the training image may be an image including the single target object, and a user may label key points of the target object corresponding to the target object in the training image in advance to obtain real key points corresponding to the training image.

In step S220, the training image is input to a neural network model to be trained, so as to identify the probability that each pixel point in the training image is the key point of the target object through the neural network model, and a plurality of predicted key points are determined based on the probability.

After the training image is obtained, the training image may be input to a neural network model to be trained, and after the probability that each pixel point in the training image is a key point of the target object is identified by the neural network model, a plurality of predicted key points may be determined from the training image based on the probability corresponding to each pixel point, for example, a key point whose corresponding probability is higher than a preset probability threshold may be determined as a predicted key point.

In step S230, a current loss function of the neural network model is determined by a supervision module preset in the neural network model.

And the supervision module determines a second loss function corresponding to the loss function according to the reference vector and the current vector. The reference vector may be a vector characterizing a relative position between a first type of keypoint and a second type of keypoint of the real keypoints, and the current vector may be a vector characterizing a relative position between a predicted first type of keypoint and a predicted second type of keypoint of the plurality of predicted keypoints.

In practical application, a monitoring module may be preset in the neural network model, after a plurality of predicted key points in the training image are obtained through the neural network model, a first loss function corresponding to the real key points and the predicted key points may be obtained through the monitoring module, meanwhile, reference vectors corresponding to the first type of key points and the second type of key points in the real key points may be obtained through the monitoring module, and current vectors corresponding to the predicted first type of key points and the predicted second type of key points in the predicted key points may be obtained, so that a corresponding second loss function may be determined based on the reference vectors and the current vectors, and further, a current loss function of the neural network model may be determined based on the first loss function and the second loss function.

In particular, limb movements are flexible and diverse, so that a variety of movement situations also exist for key points on the limb. In the related art, when the key point detection model is trained, the situation that the limb freedom degree of a target object is high and key points on limbs are difficult to accurately identify easily occurs, the deviation of a model prediction result and an actual situation is large, and the learning complexity is increased.

In the present disclosure, since the key points of the body part and the trunk part of the target object have local rigid body attributes, the vectors corresponding to the first type key points and the second type key points are obtained, so that the link between the target object and the body can be defined and limited. Specifically, after acquiring a first type of keypoint and a second type of keypoint from the real keypoints, a reference vector corresponding to the first type of keypoint and the second type of keypoint may be determined, where the reference vector may be understood as a relative position between different keypoints, for example, at least one type of reference vector may be predetermined, including: left wrist-left elbow, left elbow-left shoulder, left shoulder-left crotch, left crotch-left knee, left knee-left ankle, right wrist-right elbow, right elbow-right shoulder, right shoulder-right crotch, right crotch-right knee, right knee-right ankle, left shoulder-right shoulder, left crotch-right crotch. By acquiring the reference vector, the relative position relationship which should be kept between the corresponding key points of the limbs or the bodies of the target object in the movement process can be clarified.

After obtaining the plurality of prediction key points, a current vector corresponding to the first class key point predicted and the second class key point predicted in the prediction key points may be determined, and a type corresponding to the current vector may be the same as a type of a vector corresponding to the reference vector, for example, the current vector may include at least one of the following types of current vectors, including: left wrist-left elbow, left elbow-left shoulder, left shoulder-left crotch, left crotch-left knee, left knee-left ankle, right wrist-right elbow, right elbow-right shoulder, right shoulder-right crotch, right crotch-right knee, right knee-right ankle, left shoulder-right shoulder, left crotch-right crotch.

After the reference vector and the current vector are obtained, a loss function can be determined based on the reference vector and the current vector through a supervision module in the neural network model.

In step S240, based on the loss function, parameters of the neural network model are adjusted until a training end condition is satisfied, so as to obtain a trained neural network model.

In a specific implementation, after the loss function is obtained, parameters of the neural network model may be adjusted based on the loss function, and after the adjustment, the step S210 may be returned to, and the training of the neural network model is repeated until a training end condition is met, for example, when the current iteration number reaches a preset training number or a value corresponding to the loss function is lower than a preset training threshold, the training may be stopped, and the trained neural network model is obtained.

In step S250, the supervision module is deleted from the trained neural network model, so as to obtain the key point detection model.

After the trained neural network model is obtained, the supervision module in the model can be deleted, and a key point detection model for identifying key points of the target object is obtained. By deleting the supervision module from the key point detection model, the normal operation of the terminal equipment can be ensured after the key point detection model is deployed to the terminal equipment, such as a mobile terminal, without increasing extra computation.

In the embodiment of the disclosure, by adding the supervision module in the training process of the key point detection model, the model can be trained to pay attention to the linkage between the limbs and the trunk part, the position prediction of the key points of the target object is limited within a reasonable range, and the detection precision of the key point detection model for the key points of the limbs and the trunk part of the target object is remarkably improved while the complexity of the model training is reduced.

In an exemplary embodiment, the neural network model may include a plurality of serially connected network modules, and the network modules may be configured to output a corresponding thermodynamic diagram for the input image, where the thermodynamic diagram may also be referred to as a gaussian thermodynamic diagram, and a response value corresponding to each pixel point in the thermodynamic diagram characterizes a probability that the pixel point is a key point of the target object.

In step S220, the inputting the training image into the neural network model to be trained so as to identify, through the neural network model, a probability that each pixel point in the training image is a key point of the target object may include:

taking a feature map corresponding to the training image as an input image of a first network module in the neural network model; taking the thermodynamic diagram output by the previous network module as an input image of the next network module; and determining the probability that each pixel point in the training image is the key point of the target object based on the thermodynamic diagram output by the last network module.

In practical application, the feature map corresponding to the training image may be used as an input image of a first network module in the neural network model, the first network module may generate a corresponding thermodynamic diagram based on the input image, and input the thermodynamic diagram into a next network module corresponding to the first network module, and the process is repeated, so that the thermodynamic diagram output by a previous network module in the plurality of network modules may be used as an input image of a next network module, and a corresponding thermodynamic diagram is output.

Until the last network module in the plurality of network modules outputs a corresponding thermodynamic diagram based on the input image of the last network module corresponding to the last network module, and the probability that each pixel point in the training image is the key point of the target object can be determined based on the response value corresponding to each pixel point in the thermodynamic diagram.

According to the method and the device, the network modules can be connected in series, the thermodynamic diagrams can be optimized for multiple times through the network modules, the probability that each pixel point in the training image is the key point of the target object can be determined based on the finally output thermodynamic diagrams, and the prediction accuracy of the key point of the target object can be effectively improved.

In an exemplary embodiment, each network module may be correspondingly provided with a monitoring module, and the determining, by a preset monitoring module, a current loss function of the neural network model based on the reference vector and the current vector may include:

determining current vectors corresponding to a plurality of prediction key points in a thermodynamic diagram output by each network module, and determining a local loss function corresponding to the network module based on a reference vector and the current vector corresponding to the network module through a supervision module corresponding to the network module; determining a loss function of the neural network model based on the local loss functions corresponding to the plurality of network modules respectively.

In a specific implementation, a plurality of serially connected network modules may be provided in the neural network model, and a monitoring module may be provided for each network module individually.

When the neural network model obtains the predicted key points in the training image, for each network module, when the network module generates a corresponding thermodynamic diagram based on the input image, a plurality of predicted key points corresponding to the current thermodynamic diagram may be determined based on response values corresponding to each pixel point in the thermodynamic diagram currently output by the network module, and a current vector corresponding to the plurality of predicted key points output by the network module may be determined based on relative positions corresponding to the plurality of predicted key points. And further, the local loss function of the network module can be determined by the supervision module corresponding to the network module based on the reference vector and the current vector corresponding to the network module.

After the local loss function corresponding to each of the plurality of network modules is determined, a loss function of the neural network model may be determined based on the local loss function corresponding to each of the plurality of network modules.

As shown in fig. 3, which is a schematic structural diagram of a neural network model in an exemplary embodiment of the present disclosure, the neural network model may include a backbone network and a plurality of network modules behind the backbone network.

The module 301, the module 302 and the residual module can be included in the backbone network, and after the backbone network, 6 network modules can be included in series, and each network module can include one module 301 and one module 302. In training the neural network model, a training image may be input into the neural network model. After receiving the training image, the neural network model may obtain an original feature map corresponding to the training image through the module 301, perform linear operation on the original feature map to obtain a model feature map corresponding to the original feature map, and further may splice the original feature map and the model feature map and input the result into the module 302. In an example, the feature map generated by the module 301 may specifically be a thermodynamic diagram, and the module 301 may be a Ghostnet module, which generates the feature map by using fewer convolution kernels, further generates more simulated feature maps efficiently through a simple linear transformation operation, and finally connects and outputs the generated feature maps, so that a plurality of feature maps can be generated simply and easily at the mobile terminal.

After the module 302 obtains the spliced feature map, the features of different levels of the spliced feature map may be extracted to obtain a high resolution feature map and a low resolution feature map, and the high resolution feature map and the low resolution feature map are fused, so that the fused feature map may be output to the residual module. The feature map generated by the module 302 may be a thermodynamic map, the module 302 may be a Hourglass module, and the residual module may be a density module. In this example, by stacking the block 301, the block 302, and the residual block, it is possible to more accurately detect the target object keypoints with the same computation amount.

After the residual module performs feature extraction on the input feature map and outputs the feature map to the network module 1, the network module 1 may generate a corresponding thermodynamic diagram through the module 301 and the module 302 therein, a supervision module corresponding to the network module 1 may determine a current vector based on the thermodynamic diagram output by the network module 1, further, a local loss function-1 corresponding to the network module 1 may be determined according to a current vector corresponding to the network module 1 and a predetermined reference vector, and accordingly, a plurality of subsequent network modules connected in series may determine local loss functions-2 and … … through the same manner, and finally, a loss function corresponding to the neural network model may be determined based on the local loss functions-1 to-6, and parameters of the neural network model may be adjusted based on the loss function.

According to the method and the device, the loss function of the neural network model is determined based on the local loss functions corresponding to the network modules, so that the overall parameters of the model can be optimized more accurately based on the local loss functions corresponding to the network modules on the basis of optimization prediction of the thermodynamic diagram through the network modules, and more accurate identification results of the key points of the target object can be obtained.

In an exemplary embodiment, the keypoint detection model may further fully utilize timing information between multiple video frames to perform keypoint detection, for example, a dcpost module is used in the keypoint detection model to increase influence information of previous and subsequent video frames on a current video frame, or a structure such as Hrnet may also be used.

In an exemplary embodiment, in step S110, the obtaining the similarity between the adjacent first video frame and the second video frame may include the following steps:

acquiring a first gray scale image corresponding to a first video frame, and determining first distribution information corresponding to pixel values in the first video frame based on the first gray scale image; acquiring a second gray map corresponding to a second video frame, and determining second distribution information corresponding to pixel values in the second video frame based on the second gray map; and determining the similarity between the pair of video frames and the second video frame according to the similarity of the first distribution information and the second distribution information.

As an example, the pixel value in the first video frame may be a pixel value corresponding to each pixel point in the first grayscale map; the pixel values in the second video frame may be pixel values corresponding to pixel points in the second gray scale map.

The first distribution information may represent a distribution rule of pixel values in the first gray scale image, the second distribution information may represent a distribution rule of pixel values in the second gray scale image, and the first distribution information and the second distribution information may be histograms representing a distribution condition of pixel values.

In specific implementation, the similarity can be determined based on the histogram corresponding to the video frame, and the method has the characteristics of high operation speed and accurate similarity calculation. Specifically, after a second video frame to be currently detected is acquired, a first video frame adjacent to the second video frame may be determined, and a first grayscale corresponding to the first video frame may be acquired. After the first gray scale map is obtained, pixel values corresponding to each pixel point in the first gray scale map may be obtained, and distribution statistics may be performed on a plurality of pixel values, for example, a histogram representing a distribution condition of the pixel values is obtained, so as to obtain first distribution information corresponding to the pixel values in the first video frame.

Similarly, for the second video frame, after the grayscale image corresponding to the second video frame is obtained, the pixel value corresponding to each pixel point in the second grayscale image may be obtained, and distribution statistics may be performed on the plurality of pixel values to obtain second distribution information corresponding to the pixel values in the second video frame.

After the first distribution information and the second distribution information are obtained, the similarity of the first distribution information and the second distribution information may be obtained, for example, when the first distribution information and the second distribution information are histograms, the similarity of the first distribution information and the second distribution information may be determined by comparing the coincidence degree of the two histograms.

After determining the similarity of the first distribution information and the second distribution information, the similarity between the first video frame and the second video frame may be determined based on the similarity.

According to the method and the device, the similarity between the video frame and the second video frame can be determined according to the similarity between the first distribution information and the second distribution information by determining the first distribution information corresponding to the pixel values in the first video frame and the second distribution information corresponding to the pixel values in the second video frame, so that the similarity between the video frames can be determined quickly and accurately by comparing the distribution information of the pixel values in the image, and an accurate judgment basis can be provided while the power consumption of the terminal equipment is prevented from being increased.

In another example, the similarity calculation algorithm may also be determined according to actual situations, for example, a Cosine similarity (Cosine similarity) algorithm, a Perceptual hash algorithm (pHash, Perceptual hash algorithm), a content feature algorithm, or the like may also be used to calculate the similarity between the first video frame and the second video frame.

acquiring a target object key point corresponding to the first video frame, and determining an external frame corresponding to the target object key point of the first video frame; and determining a target object area in the second video frame according to the area of the external frame in the first video frame.

In a specific implementation, when the similarity between the first video frame and the second video frame is greater than or equal to a preset threshold, the detected key point of the target object of the first video frame may be obtained, and the circumscribed frame corresponding to the key point of the target object may be determined.

Specifically, when identifying the key points of the target object, the key point identification may be performed in a top-down manner or a bottom-up manner, that is, with the entire image or a single target object as an input. In this embodiment, when a single target object is used as an input, for each target object in the first video frame, a target object key point corresponding to the target object may be obtained, and a circumscribed frame corresponding to the target object key point may be determined, for example, a corresponding minimum circumscribed rectangle frame is obtained.

After the circumscribed frame corresponding to the target object key point in the first video frame is obtained, the target object region in the second video frame may be determined according to the region of the circumscribed frame in the first video frame, for example, the region of the circumscribed frame in the first video frame may be obtained, and the coordinate corresponding to the region may be determined as the coordinate corresponding to the target object region in the second video frame, so as to locate the target object region in the second video frame.

In the related art, although it is possible to identify a region corresponding to a target object in an image frame by identifying a plurality of video frames, a target object key point is determined in the region. When a plurality of video frames are independently detected in the method, when the difference of the detection results of the target object areas among the video frames is large, the target object key points among the video frame sequences are easy to shake obviously, and a lot of inconvenience is brought to subsequent video editing, such as the beautifying effect shake of the target object, the transition among the frames is unnatural and the like.

In the disclosure, when the similarity between the first video frame and the second video frame is greater than or equal to the preset threshold, the area corresponding to the target object in the second video frame is determined based on the area where the target object key point in the adjacent previous video frame is located, that is, the area where the target object key point in the first video frame is located, so that under the condition that video scenes are similar, for example, when a scene mutation does not occur, the position of the target object in the first video frame and the position of the target object in the second video frame are prevented from being obviously different, the jitter of the target object key point between the video frames is reduced, the transition of the target object key point between the video frames is natural, and the video beautification effect is optimized.

It should be understood that, although the steps in the flowcharts of fig. 1 and 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in fig. 1 and 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Fig. 4 is a block diagram illustrating an apparatus for detecting key points of a target object according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a similarity acquisition unit 401, a first target object recognition unit 402, a second target object recognition unit 403, and a key point detection unit 404.

A similarity acquisition unit 401 configured to perform acquisition of a similarity between adjacent first and second video frames; the video frame sequence of the first video frame is before the second video frame, and the second video frame is the current video frame to be detected;

a first target object identification unit 402, configured to execute, when the similarity is smaller than a preset threshold, invoking an object detection model to determine a target object region in the second video frame;

a second target object identification unit 403 configured to perform, when the similarity is greater than or equal to the preset threshold, determining a target object area in the second video frame based on the target object key points detected by the first video frame;

a keypoint detection unit 404 configured to perform detecting target object keypoints in the second video frame based on the target object region determined from the second video frame.

In an exemplary embodiment, the key point detecting unit includes:

In an exemplary embodiment, the keypoint detection model is obtained based on:

the training image input unit includes:

the loss function determination unit includes:

In an exemplary embodiment, the similarity obtaining unit includes:

In an exemplary embodiment, the second target object recognition unit includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating an electronic device 500 for implementing a method for detecting key points of a target object according to an example embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so forth.

Referring to fig. 5, electronic device 500 may include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, interface to input/output (I/O) 512, sensor component 514, and communication component 516.

The processing component 502 generally controls overall operation of the electronic device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operations at the electronic device 500. Examples of such data include instructions for any application or method operating on the electronic device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.

The power supply component 506 provides power to the various components of the electronic device 500. The power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 500.

The multimedia component 508 includes a screen providing an output interface between the electronic device 500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 also includes a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the electronic device 500. For example, the sensor assembly 514 may detect an open/closed state of the electronic device 500, the relative positioning of components, such as a display and keypad of the electronic device 500, the sensor assembly 514 may detect a change in the position of the electronic device 500 or components of the electronic device 500, the presence or absence of user contact with the electronic device 500, orientation or acceleration/deceleration of the device 500, and a change in the temperature of the electronic device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate wired or wireless communication between the electronic device 500 and other devices. The electronic device 500 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 516 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the electronic device 500 to perform the above-described method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by the processor 520 of the electronic device 500 to perform the above-described method.

Fig. 6 is a block diagram illustrating an electronic device 600 for implementing a method for detecting key points of a target object according to an example embodiment. For example, the electronic device 600 may be a server. Referring to fig. 6, electronic device 600 includes a processing component 620 that further includes one or more processors, and memory resources, represented by memory 622, for storing instructions, such as application programs, that are executable by processing component 620. The application programs stored in memory 622 may include one or more modules that each correspond to a set of instructions. Further, the processing component 620 is configured to execute instructions to perform the above-described methods.

The electronic device 600 may further include: the power component 624 is configured to perform power management for the electronic device 600, the wired or wireless network interface 626 is configured to connect the electronic device 600 to a network, and the input/output (I/O) interface 628. The electronic device 600 may operate based on an operating system stored in the memory 622, such as Window 66 over, Mac O6X, Unix, Linux, FreeB6D, or the like.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 622 comprising instructions, executable by the processor of the electronic device 600 to perform the above-described method is also provided. The storage medium may be a computer-readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by a processor of the electronic device 600 to perform the above-described method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for detecting key points of a target object is characterized by comprising the following steps:

2. The method of claim 1, wherein detecting a target object keypoint in the second video frame based on the target object region determined from the second video frame comprises:

3. The method of claim 2, wherein the keypoint detection model is derived based on:

4. The method of claim 3, wherein the neural network model comprises a plurality of serially connected network modules for outputting corresponding thermodynamic diagrams for input images; the corresponding response value of each pixel point in the thermodynamic diagram represents the probability that the pixel point is a key point of a target object;

5. The method according to claim 4, wherein each network module is provided with a monitoring module;

6. The method of claim 1, wherein obtaining the similarity between the adjacent first video frame and the second video frame comprises:

7. An apparatus for detecting key points of a target object, comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of target object keypoints detection according to any one of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of detecting target object keypoints according to any one of claims 1 to 6.

10. A computer program product comprising instructions therein, which when executed by a processor of an electronic device, enable the electronic device to perform a method of detecting target object keypoints according to any one of claims 1 to 6.