CN114582016A - Key point positioning method and device, electronic equipment and storage medium - Google Patents

Key point positioning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114582016A
CN114582016A CN202210194119.1A CN202210194119A CN114582016A CN 114582016 A CN114582016 A CN 114582016A CN 202210194119 A CN202210194119 A CN 202210194119A CN 114582016 A CN114582016 A CN 114582016A
Authority
CN
China
Prior art keywords
image
key point
confidence
hidden layer
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210194119.1A
Other languages
Chinese (zh)
Inventor
张夏杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202210194119.1A priority Critical patent/CN114582016A/en
Publication of CN114582016A publication Critical patent/CN114582016A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The application discloses a key point positioning method, a key point positioning device, electronic equipment and a storage medium, wherein the key point positioning method comprises the following steps: cropping a second image based on the first confidence of the first image; the image corresponding to the first confidence coefficient representation contains the confidence coefficient of the target object in the image obtained after cutting; inputting the cut second image into a first set model to obtain key point information of the second image and a first confidence coefficient of the second image; the first setting model is used for identifying key point information and a first confidence coefficient of the input image; wherein the first image represents a previous frame image of the second image in the current video stream.

Description

Key point positioning method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a method and an apparatus for locating a key point, an electronic device and a storage medium.
Background
The key point positioning is a technology for finding out the position of a key point in an image by utilizing an image processing and machine learning method. As a basis of many computer vision tasks, key point positioning is widely applied to human-computer interaction scenes based on video streams, such as Augmented Reality (AR), Virtual Reality (VR), and automatic driving. When the key point of the video stream is positioned, the problem of long time consumption of positioning exists.
Disclosure of Invention
In view of this, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for positioning key points, so as to at least solve the problem that positioning takes a long time when performing key point positioning of a video stream in the related art.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a key point positioning method, which comprises the following steps:
cropping a second image based on the first confidence of the first image; the image corresponding to the first confidence coefficient representation contains the confidence coefficient of the target object in the image obtained after cutting;
inputting the cut second image into a first set model to obtain key point information of the second image and a first confidence coefficient of the second image; the first setting model is used for identifying key point information and a first confidence coefficient of the input image; wherein the content of the first and second substances,
the first image represents a previous frame image of the second image in a current video stream.
In the foregoing solution, the cropping the second image based on the first confidence of the first image includes:
determining a cutting frame corresponding to a second image based on the relation between the first confidence coefficient of the first image and a first set threshold value;
and cutting the second image according to the determined cutting frame.
In the foregoing solution, the determining a crop box corresponding to a second image based on a relationship between a first confidence of the first image and a first set threshold includes:
determining a cutting frame corresponding to a second image based on the key point information of the first image under the condition that the first confidence of the first image is larger than a first set threshold;
under the condition that the first confidence of the first image is smaller than or equal to a first set threshold, inputting a second image into a second set model to obtain a cutting frame corresponding to the second image; the second setting model is used for positioning a target object in the input image to obtain a corresponding cutting frame.
In the above solution, when the first confidence of the first image is greater than a first set threshold, the determining, based on the keypoint information of the first image, a crop box corresponding to the second image includes:
determining a first region in a first image based on keypoint information of the first image; all keypoints of the first image are located within the first region;
and obtaining a cutting frame corresponding to the second image by adjusting the first area.
In the above solution, the first set model includes a first hidden layer, a second hidden layer, and a third hidden layer, and the first hidden layer and the second hidden layer are connected in parallel and are connected in series with the third hidden layer; the step of inputting the cut second image into a first set model to obtain the key point information of the second image and the first confidence coefficient of the second image includes:
inputting the cut second image to the third hidden layer to obtain a first feature map corresponding to the second image;
inputting the first feature map into the first hidden layer to obtain key point information of a second image;
and inputting the first feature map into the second hidden layer to obtain a first confidence coefficient of the second image.
In the foregoing solution, before the inputting the cropped second image to the third hidden layer, the method further includes:
inputting a video stream sample into a first set model to obtain a second confidence coefficient of at least two frames of images of the video stream sample;
determining a loss value based on the second confidence degrees of the at least two frames of image samples and the corresponding calibration results;
and updating the weight parameter of the second hidden layer of the first set model according to the determined loss value.
In the above scheme, at least one frame of image in the video stream carries a hand key point.
The embodiment of the present application further provides a key point positioning device, including:
a first processing unit for cropping a second image based on a first confidence of the first image; the image corresponding to the first confidence coefficient representation contains the confidence coefficient of the target object in the image obtained after cutting;
the second processing unit is used for inputting the second image after being cut into the first set model to obtain key point information of the second image and a first confidence coefficient of the second image; the first setting model is used for identifying key point information and a first confidence coefficient of an input image; wherein the content of the first and second substances,
the first image represents a previous frame image of the second image in a current video stream.
An embodiment of the present application further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor,
wherein the processor is configured to execute the steps of the key point localization method when running the computer program.
The embodiment of the present application further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned key point positioning method.
In the method for positioning the key points, a second image is cut based on the confidence that a first image contains a target object in an image obtained after cutting, the second image after cutting is input to a first set model, and the key point information of the second image and the confidence that the second image contains the target object in the image obtained after cutting are obtained, wherein the first image represents the previous frame image of the second image in the current video stream. The electronic equipment cuts the frame image based on the confidence degree that the previous frame image contains the target object in the cut image, so that the detection times of the electronic equipment to the image frame of the current video stream can be reduced, the calculation force is saved, the speed of the electronic equipment in the positioning process of the key point is improved, and the time consumption of the positioning of the key point is reduced.
Drawings
Fig. 1 is a schematic diagram of an architecture of a key point positioning system according to an embodiment of the present disclosure;
fig. 2 is a schematic flow chart illustrating an implementation of a key point positioning method according to an embodiment of the present disclosure;
FIG. 3 is a diagram illustrating a method for determining a crop box according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a first setting model according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a keypoint model according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a hand confidence branch according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a key point positioning device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The key point positioning is a technology for finding out the position of a key point in an image by utilizing an image processing and machine learning method. As a basis for many computer vision tasks, key point localization is widely applied to video stream-based human-computer interaction scenarios, such as AR, VR, and autopilot. For example, when using VR glasses, a user needs to operate a virtual interface by recognizing the movement of a hand.
In the related art, the key point of the video stream is positioned by combining the serial detection model and the key point positioning model, so that the problems of large calculation power consumption and long positioning time are solved. And, with the rapid development of mobile terminal devices, model combinations are more deployed to mobile terminal devices. And most terminal devices are difficult to support normal use of model combination due to limitation of computing resources, and the problem of limited application scenes of the models exists.
Based on this, in the method for positioning a keypoint, a second image is cropped based on a confidence that a first image includes a target object in an image obtained by cropping, the second image after cropping is input to a first setting model, and keypoint information of the second image and the confidence that the second image includes the target object in the image obtained by cropping are obtained, wherein the first image represents a previous frame image of the second image in a current video stream. The electronic equipment cuts the frame image based on the confidence degree that the previous frame image contains the target object in the cut image, so that the detection times of the electronic equipment to the image frame of the current video stream can be reduced, the calculation force is saved, the speed of the electronic equipment in the positioning process of the key point is improved, and the time consumption of the positioning of the key point is reduced.
Furthermore, the model can be generally applied to various electronic devices or various scenes by saving the computational power consumption of the model.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Fig. 1 is a schematic diagram of an architecture of a keypoint locating system according to an embodiment of the present disclosure. In the embodiments of the present application, a combination of a detection model and a keypoint localization model connected in series in the related art is improved, so as to obtain the architecture of the keypoint localization system shown in fig. 1.
And inputting a first frame image in the video stream into the detection model to obtain a cutting frame, cutting out a target area from the first frame image, and inputting the target area into the key point model. The key point branch outputs a key point positioning result of the first frame of image, and the hand confidence branch outputs a hand confidence result used for determining whether the next frame of image is input into the detection model. If the confidence coefficient is smaller than the set threshold value, inputting the next frame of image into the detection model to obtain a cutting frame; if the confidence coefficient is larger than the set threshold value, the next frame image does not need to input the detection model, but the target object is cut out from the next frame image according to the key point positioning result of the previous frame image and is sent into the key point model, so that the key point positioning result and the hand confidence coefficient result of the next frame image are obtained.
In the embodiments of the application, by optimizing the framework of the key point positioning system, the detection times of the electronic device on the image frames of the current video stream can be reduced, the calculation power is saved, and the speed of key point positioning is increased, so that the time consumption of key point positioning is reduced.
The following describes in detail the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems by embodiments and with reference to the drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic view of an implementation process of the key point positioning method provided in the embodiment of the present application, where an execution subject of the process is an electronic device, including but not limited to a mobile terminal device such as a mobile phone, a tablet, a smart watch, and a smart home appliance. As shown in fig. 2, the method includes:
step 201: based on the first confidence of the first image, the second image is cropped.
The image obtained after cutting comprises the confidence coefficient of the target object; the first image represents a previous frame image of the second image in a current video stream.
Determining a cutting frame corresponding to the second image based on the confidence degree that the first image in the current video stream contains the target object in the cut image, and cutting the second image based on the determined cutting frame.
Here, the confidence level represents the possibility that the image obtained by the cropping contains the target object, and the higher the confidence level, the more likely the corresponding image contains the target object.
Step 202: inputting the cut second image into a first set model to obtain key point information of the second image and a first confidence coefficient of the second image; the first set model is used to identify keypoint information and a first confidence of the input image.
And inputting the cut second image into the first setting model to obtain the key point information of the second image, and inputting the confidence coefficient of the cut second image of the first setting model, wherein the confidence coefficient comprises the target object.
Here, the first setting model is used to identify the key point information and the first confidence of the input image, and since the second image after clipping is a part or the whole of the second image without clipping, the key point information of the second image without clipping can be specified based on the key point information of the second image after clipping and the position information of the clipping frame used when clipping the second image. In practical application, parameters of the partial hidden layer of the first set model may be set according to model parameters in the related art to obtain the keypoint information of the second image. Here, the first setting model is the key point positioning model in fig. 1.
The key point information of the second image includes, but is not limited to, the following characterization methods: key point information is marked on the second image; information for determining the keypoints, such as coordinate information of the keypoints.
The electronic equipment cuts the image based on the confidence that the cut image of the previous frame of image contains the target object, so that the detection times of the image frame of the current video stream can be reduced, the calculation force is saved, the speed of the key point during positioning is improved, and the time consumed by the key point positioning is reduced. Furthermore, the model can be generally applied to various electronic devices or various scenes by saving the computational power consumption of the model.
In an embodiment, the cropping the second image based on the first confidence level of the first image includes:
determining a cutting frame corresponding to a second image based on the relation between the first confidence coefficient of the first image and a first set threshold value;
and cutting the second image according to the determined cutting frame.
Determining a cutting frame corresponding to the second image in a corresponding mode based on the size relation between the confidence coefficient of the target object and the first set threshold contained in the image obtained after cutting the previous frame of image in the current video stream, and cutting the second image based on the determined cutting frame.
In practical applications, even if the image input into the first setting model does not include the target object, the first setting model performs the key point recognition on the input image and outputs the key point information of the image, and the key point information cannot represent the corresponding target object. Therefore, whether the key point information of the frame image needs to be detected or not is judged according to whether the target object is contained in the previous frame image or not, and therefore the key point detection frame skipping of the video stream is achieved.
In an embodiment, the determining, based on a relationship between the first confidence of the first image and a first set threshold, a crop box corresponding to the second image includes:
determining a cutting frame corresponding to a second image based on the key point information of the first image under the condition that the first confidence of the first image is larger than a first set threshold;
under the condition that the first confidence of the first image is smaller than or equal to a first set threshold, inputting a second image into a second set model to obtain a cutting frame corresponding to the second image; the second setting model is used for positioning a target object in the input image to obtain a corresponding cutting frame.
Judging whether the first confidence of the first image is larger than a first set threshold, and determining a cutting frame corresponding to the second image in a corresponding mode:
and determining a cutting frame corresponding to the second image based on the key point information of the first image when the first confidence coefficient of the first image is larger than a first set threshold value. The first confidence may be set to 0.95.
And under the condition that the first confidence of the first image is less than or equal to a first set threshold, inputting the second image into a second set model to obtain a cutting frame corresponding to the second image. Here, the second setting model is the detection model in fig. 1.
Based on the confidence that the image obtained by cutting the previous frame of image contains the target object, under the condition that the confidence is higher than a first set threshold, the probability of representing that the previous frame of image contains the target object is very high, the key point information can represent the corresponding target object, and the images of the video stream are coherent, so that the cutting frame of the frame of image can be determined according to the key point information of the previous frame of image.
Furthermore, the model can be generally applied to various electronic devices or various scenes by saving the computational power consumption of the model.
For a first frame image of a current video stream, because a previous frame image does not exist, the electronic device needs to input the first frame image into the second setting model to obtain a corresponding clipping frame, clip the first frame image based on the corresponding clipping frame, and input the clipped first frame image into the first setting model to obtain the key point information of the first frame image and the confidence coefficient that the clipped first frame image contains a target object.
In an embodiment, in a case that the first confidence of the first image is greater than a first set threshold, the determining, based on the keypoint information of the first image, a crop box corresponding to the second image includes:
determining a first region in a first image based on keypoint information of the first image; all keypoints of the first image are located within the first region;
and obtaining a cutting frame corresponding to the second image by adjusting the first area.
And under the condition that the first confidence of the first image is greater than a first set threshold, determining a first region in the first image based on the key point information of the first image, wherein all key points of the first image are included in the first region, adjusting the first region, and determining the position information of a cutting frame for cutting the second image so as to obtain the cutting frame corresponding to the second image.
In some embodiments, in determining the position information of the crop box for cropping the second image, the region may be scaled by a set multiple centering on the center of the region, thereby determining the position information of the crop box for cropping the second image. Here, when the first region is adjusted, the zoomed region may be the first region, or may be a second region specified in the first image based on the size information of the first region.
Taking the schematic diagram of fig. 3 for determining a crop box in an image as an example, a description will be given of a manner of scaling the second region by a set multiple and determining position information of the crop box for cropping the second image:
determining a circumscribed rectangle including all key points as a first area based on the key point information of the nth frame image, determining a square coinciding with the center of the rectangle as a second area based on the larger value of the length and the width (here, the length of the rectangle), zooming the second area by taking the center of the second area as the center according to a set multiple of 1.5, and determining the position information of the cutting frame of the (n + 1) th frame image, thereby obtaining the cutting frame corresponding to the (n + 1) th frame image.
In practical applications, the shape of the first region may be a regular shape such as a rectangle, an ellipse, or a trapezoid, or may be an irregular shape.
In an embodiment, the first set model includes a first hidden layer, a second hidden layer, and a third hidden layer, and the first hidden layer and the second hidden layer are connected in parallel and in series with the third hidden layer; the step of inputting the cut second image into a first setting model to obtain the key point information of the second image and the first confidence coefficient of the second image includes:
inputting the cut second image to the third hidden layer to obtain a first feature map corresponding to the second image;
inputting the first feature map into the first hidden layer to obtain key point information of a second image;
and inputting the first feature map into the second hidden layer to obtain a first confidence coefficient of the second image.
With reference to fig. 4, fig. 4 shows a schematic structural diagram of a first setting model provided in an embodiment of the present application. As shown in fig. 4, the first configuration model includes a first hidden layer, a second hidden layer, and a third hidden layer.
The electronic equipment inputs the cut second image into a third hidden layer of the first set model to perform feature extraction processing, so as to obtain a first feature map corresponding to the input image (namely the cut second image), and the first feature map is respectively input into a first hidden layer and a second hidden layer of the first set model, so as to obtain key point information and a first confidence coefficient corresponding to the input image.
Here, one or more of the first hidden layer, the second hidden layer, and the third hidden layer may be a neural network, and the hidden layer as the neural network includes at least one convolutional layer. The third hidden layer may be a full convolution layer, or may be a Support Vector Machine (SVM) classifier, a softmax classifier, or the like.
In an embodiment, before the inputting the cropped second image to the third hidden layer, the method further comprises:
inputting a video stream sample into a first set model to obtain a second confidence coefficient of at least two frames of images of the video stream sample;
determining a loss value based on the second confidence degrees of the at least two frames of image samples and the corresponding calibration results;
and updating the weight parameter of the second hidden layer of the first set model according to the determined loss value.
The method includes the steps of training a first set model based on a video stream sample, inputting at least two frames of images of the video stream sample as image samples into the first set model respectively to obtain a second confidence coefficient of each frame of image sample, determining a loss value based on the second confidence coefficient of the image sample and a corresponding calibration result, and updating a weight parameter of a second hidden layer of the first set model according to the determined loss value.
And the electronic equipment updates the weight parameters of the second hidden layer according to the loss value of the first set model so as to improve the accuracy of the first confidence result output by the first set model. The electronic equipment performs back propagation on the loss value of the first set model in the second hidden layer, calculates the gradient of the loss function according to the loss value in the process of back propagation of the loss value in the second hidden layer of the first set model, and updates the weight parameter back propagated to the current layer along the descending direction of the gradient.
And the electronic equipment takes the updated weight parameters as the weight parameters used by the trained first setting model.
Here, an update stop condition may be set, and when the update stop condition is satisfied, the weight parameter obtained by the last update may be determined as the weight parameter used by the trained first setting model. Updating the stopping condition such as a set training round (epoch), one training round being a process of training the first set model once based on at least one image sample. Of course, the update stop condition is not limited to this, and may be, for example, a set Average accuracy (mAP) or the like.
It should be noted that a loss function (loss function) is used to measure the degree of inconsistency between the predicted value and the true value (calibration value) of the model. In practical applications, model training is achieved by minimizing a loss function.
The backward propagation is relative to the forward propagation, which refers to the feedforward processing of the model, and the backward propagation is opposite to the forward propagation. And the back propagation refers to updating the weight parameters of each layer of the model according to the output result of the model.
In an embodiment, before training the second hidden layer, the model parameters except the second hidden layer are determined, and the loss value of the second hidden layer is calculated, so that the first set model can obtain the confidence for judging whether the image contains the target object or not without affecting the key point identification capability of the model.
In addition, when the first set model is set, other hidden layers except the second hidden layer can be set according to model parameters in the related art, so that the model training process is simplified, and the key point positioning method of the embodiment can be widely applied to various existing scenes for positioning key points.
As mentioned above, when using VR glasses, the user needs to operate the virtual interface by recognizing the movement of the hand. In an embodiment, at least one frame of image in the video stream carries a hand key point.
Because the positioning difficulty of the non-rigid body key points with multiple degrees of freedom is higher, compared with the positioning of the key points of the human face, the hand is usually modeled into 21 joint points and has higher degree of freedom, so that the hand key point positioning model is required to have higher modeling capability, more model parameter quantity and larger calculated quantity.
The method and the device are applied to scenes for positioning non-rigid key points such as hand key points and the like, can improve the speed of positioning the key points, save the computational power consumption of the model, and enable the model to be widely applicable to various electronic equipment represented by mobile terminal equipment or various human-computer interaction scenes.
The following takes the hand key point positioning as a specific scene, constructs a key point positioning model by an encoding-decoding (encode-decode) structure, and further describes the present application in detail with reference to an application embodiment.
Fig. 5 is a schematic structural diagram of a keypoint model provided by an embodiment of the present application, in which a classification branch (also referred to as a hand confidence branch) is added after an encoding part and in parallel with a decoding part (i.e., a keypoint branch).
And inputting the clipped 256 × 3 image into a coding part, outputting 8 × 96 feature maps by the coding part, and respectively taking the feature maps as the input of the key point branches and the hand confidence branches to obtain the results respectively output by the key point branches and the hand confidence branches.
In practical application, the key point branches can be set as required. Fig. 6 shows a schematic structure of a hand confidence branch, which changes the feature map of 8 × 96 into 1 × 1. The resolution is reduced from 8 x 8 to 2 x 2 through 3 convolution layers, 2 x 2 is converted into 1 x 1 through a maximum pooling layer, the number of channels is converted into 1 through 1 x 1 convolution, and finally the confidence coefficient result is restrained between 0 and 1 through a softmax activation function.
In the training process of the key point branches, L2 loss is adopted, parameters of an encoding part and a decoding part are solidified simultaneously, and only weight parameters of the key point branches are updated.
In the application embodiment of the application, the framework of the hand key point positioning algorithm is optimized by adding the hand confidence degree branch to the key point model, so that the key point detection frame skipping of the video stream is realized, the running speed of the whole algorithm is improved, and the model can be widely suitable for various electronic equipment or various scenes by saving the computational power consumption of the model.
In order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides a key point positioning apparatus, as shown in fig. 7, the apparatus includes:
a first processing unit 701, configured to crop a second image based on the first confidence of the first image; the image corresponding to the first confidence coefficient representation contains the confidence coefficient of the target object in the image obtained after cutting;
a second processing unit 702, configured to input the clipped second image to the first setting model, so as to obtain keypoint information of the second image and a first confidence of the second image; the first setting model is used for identifying key point information and a first confidence coefficient of an input image; wherein the content of the first and second substances,
the first image represents a previous frame image of the second image in a current video stream.
In an embodiment, the first processing unit 701 is configured to:
determining a cutting frame corresponding to a second image based on the relation between the first confidence coefficient of the first image and a first set threshold value;
and cutting the second image according to the determined cutting frame.
In an embodiment, the first processing unit 701 is further configured to:
determining a cutting frame corresponding to a second image based on the key point information of the first image under the condition that the first confidence of the first image is larger than a first set threshold;
under the condition that the first confidence coefficient of the first image is smaller than or equal to a first set threshold value, inputting a second image into a second set model to obtain a cutting frame corresponding to the second image; the second setting model is used for positioning a target object in the input image to obtain a corresponding cutting frame.
In an embodiment, in a case that the first confidence of the first image is greater than a first set threshold, the first processing unit 701 is further configured to:
determining a first region in a first image based on keypoint information of the first image; all keypoints of the first image are located within the first region;
and obtaining a cutting frame corresponding to the second image by adjusting the first area.
In one embodiment, the first set model comprises a first hidden layer, a second hidden layer and a third hidden layer, the first hidden layer, the second hidden layer and the third hidden layer are connected in parallel and in series; the second processing unit 702 is configured to:
inputting the cut second image to the third hidden layer to obtain a first feature map corresponding to the second image;
inputting the first feature map into the first hidden layer to obtain key point information of a second image;
and inputting the first feature map into the second hidden layer to obtain a first confidence coefficient of the second image.
In one embodiment, the apparatus further comprises:
the input unit is used for inputting a video stream sample into a first setting model to obtain a second confidence coefficient of at least two frames of images of the video stream sample;
the third processing unit is used for determining a loss value based on the second confidence degrees of the at least two frames of image samples and the corresponding calibration results;
and the updating unit is used for updating the weight parameter of the second hidden layer of the first set model according to the determined loss value.
In one embodiment, at least one frame of image in the video stream carries a hand keypoint.
In practical applications, the first Processing Unit 701, the second Processing Unit 702, the input Unit, the third Processing Unit, and the update Unit may be implemented by a Processor in a key-point-based positioning device, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA).
It should be noted that: in the key point positioning device provided in the above embodiment, only the division of the program modules is exemplified when performing key point positioning, and in practical applications, the processing distribution may be completed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules to complete all or part of the processing described above. In addition, the key point positioning device and the key point positioning method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Based on the hardware implementation of the program module, and in order to implement the method for locating a key point in the embodiment of the present application, an embodiment of the present application further provides an electronic device, as shown in fig. 8, where the electronic device includes:
a communication interface 1 capable of information interaction with other devices such as network devices and the like;
and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.
In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 8.
The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory 3 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Double Data Rate Synchronous Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Random Access Memory (DRAM), Synchronous Random Access Memory (DRAM), Direct Random Access Memory (DRmb Access Memory). The memory 2 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the embodiment of the present application may be applied to the processor 2, or may be implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.
When the processor 2 executes the program, the corresponding processes in the methods of the embodiments of the present application are implemented, and for brevity, are not described herein again.
In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The technical means described in the embodiments of the present application may be arbitrarily combined without conflict. Unless otherwise specified and limited, the term "coupled" is to be construed broadly, e.g., as meaning electrical connections, or as meaning communications between two elements, either directly or indirectly through intervening media, as well as the specific meanings of such terms as understood by those skilled in the art.
In addition, in the examples of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a particular order or sequence. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any combination of at least two of any one or more of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Various combinations of the specific features in the embodiments described in the detailed description may be made without contradiction, for example, different embodiments may be formed by different combinations of the specific features, and in order to avoid unnecessary repetition, various possible combinations of the specific features in the present application will not be described separately.

Claims (10)

1. A method for locating a keypoint, the method comprising:
cropping a second image based on the first confidence of the first image; the image corresponding to the first confidence coefficient representation contains the confidence coefficient of the target object in the image obtained after cutting;
inputting the cut second image into a first set model to obtain key point information of the second image and a first confidence coefficient of the second image; the first setting model is used for identifying key point information and a first confidence coefficient of an input image; wherein the content of the first and second substances,
the first image represents a previous frame image of the second image in a current video stream.
2. The method of claim 1, wherein cropping the second image based on the first confidence level of the first image comprises:
determining a cutting frame corresponding to a second image based on the relation between the first confidence coefficient of the first image and a first set threshold value;
and cutting the second image according to the determined cutting frame.
3. The method of claim 2, wherein determining a corresponding crop box for a second image based on a relationship between a first confidence level of the first image and a first set threshold comprises:
determining a cutting frame corresponding to a second image based on the key point information of the first image under the condition that the first confidence of the first image is larger than a first set threshold;
under the condition that the first confidence of the first image is smaller than or equal to a first set threshold, inputting a second image into a second set model to obtain a cutting frame corresponding to the second image; the second setting model is used for positioning a target object in the input image to obtain a corresponding cutting frame.
4. The method according to claim 3, wherein in a case that the first confidence of the first image is greater than a first set threshold, the determining a crop box corresponding to the second image based on the keypoint information of the first image comprises:
determining a first region in a first image based on keypoint information of the first image; all keypoints of the first image are located within the first region;
and obtaining a cutting frame corresponding to the second image by adjusting the first area.
5. The method of claim 1, wherein the first set model comprises a first hidden layer, a second hidden layer, and a third hidden layer, the first hidden layer, the second hidden layer being connected in parallel and in series with the third hidden layer; the step of inputting the cut second image into a first set model to obtain the key point information of the second image and the first confidence coefficient of the second image includes:
inputting the cut second image to the third hidden layer to obtain a first feature map corresponding to the second image;
inputting the first feature map into the first hidden layer to obtain key point information of a second image;
and inputting the first feature map into the second hidden layer to obtain a first confidence coefficient of the second image.
6. The method according to claim 5, wherein before the inputting the cropped second image to the third hidden layer, the method further comprises:
inputting a video stream sample into a first set model to obtain a second confidence coefficient of at least two frames of images of the video stream sample;
determining a loss value based on the second confidence degrees of the at least two frames of image samples and the corresponding calibration results;
and updating the weight parameter of the second hidden layer of the first set model according to the determined loss value.
7. The method according to any one of claims 1 to 6, wherein at least one frame of image in the video stream carries a hand key point.
8. A keypoint locating device, comprising:
a first processing unit for cropping a second image based on a first confidence of the first image; the image corresponding to the first confidence coefficient representation contains the confidence coefficient of the target object in the image obtained after cutting;
the second processing unit is used for inputting the cut second image into the first set model to obtain the key point information of the second image and the first confidence coefficient of the second image; the first setting model is used for identifying key point information and a first confidence coefficient of an input image; wherein the content of the first and second substances,
the first image represents a previous frame image of the second image in a current video stream.
9. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor,
wherein the processor is adapted to perform the steps of the method of any one of claims 1 to 7 when running the computer program.
10. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 7.
CN202210194119.1A 2022-03-01 2022-03-01 Key point positioning method and device, electronic equipment and storage medium Pending CN114582016A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210194119.1A CN114582016A (en) 2022-03-01 2022-03-01 Key point positioning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210194119.1A CN114582016A (en) 2022-03-01 2022-03-01 Key point positioning method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114582016A true CN114582016A (en) 2022-06-03

Family

ID=81772250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210194119.1A Pending CN114582016A (en) 2022-03-01 2022-03-01 Key point positioning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114582016A (en)

Similar Documents

Publication Publication Date Title
WO2022078041A1 (en) Occlusion detection model training method and facial image beautification method
JP2022532460A (en) Model training methods, equipment, terminals and programs
CN110837811A (en) Method, device and equipment for generating semantic segmentation network structure and storage medium
JP7286013B2 (en) Video content recognition method, apparatus, program and computer device
CN107944381B (en) Face tracking method, face tracking device, terminal and storage medium
CN110009662B (en) Face tracking method and device, electronic equipment and computer readable storage medium
CN109902588B (en) Gesture recognition method and device and computer readable storage medium
US20220207913A1 (en) Method and device for training multi-task recognition model and computer-readable storage medium
CN112465029A (en) Instance tracking method and device
CN111652181B (en) Target tracking method and device and electronic equipment
CN112200310B (en) Intelligent processor, data processing method and storage medium
CN112016548B (en) Cover picture display method and related device
CN110796115B (en) Image detection method and device, electronic equipment and readable storage medium
US20230021551A1 (en) Using training images and scaled training images to train an image segmentation model
CN114582016A (en) Key point positioning method and device, electronic equipment and storage medium
CN115830633A (en) Pedestrian re-identification method and system based on multitask learning residual error neural network
CN113361519B (en) Target processing method, training method of target processing model and device thereof
CN114998814A (en) Target video generation method and device, computer equipment and storage medium
CN108388886A (en) Method, apparatus, terminal and the computer readable storage medium of image scene identification
CN115147434A (en) Image processing method, device, terminal equipment and computer readable storage medium
CN111260692A (en) Face tracking method, device, equipment and storage medium
CN113128277A (en) Generation method of face key point detection model and related equipment
CN117726907B (en) Training method of modeling model, three-dimensional human modeling method and device
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN113869529B (en) Method for generating challenge samples, model evaluation method, device and computer device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination