CN112784743B - Method and device for identifying key points and storage medium - Google Patents

Method and device for identifying key points and storage medium Download PDF

Info

Publication number
CN112784743B
CN112784743B CN202110085505.2A CN202110085505A CN112784743B CN 112784743 B CN112784743 B CN 112784743B CN 202110085505 A CN202110085505 A CN 202110085505A CN 112784743 B CN112784743 B CN 112784743B
Authority
CN
China
Prior art keywords
feature map
layer
convolution layer
reference feature
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110085505.2A
Other languages
Chinese (zh)
Other versions
CN112784743A (en
Inventor
沈辉
王健
杨黔生
孙昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110085505.2A priority Critical patent/CN112784743B/en
Publication of CN112784743A publication Critical patent/CN112784743A/en
Application granted granted Critical
Publication of CN112784743B publication Critical patent/CN112784743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm

Abstract

The invention discloses a method, a device and a storage medium for identifying key points, and relates to the technical field of artificial intelligence such as deep learning, computer vision and the like. The specific implementation scheme is as follows: inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, and obtaining an initial feature map output by each layer of convolutional layers, wherein M is a natural number greater than 2; acquiring a reference feature map corresponding to each convolution layer according to the initial feature map; respectively acquiring thermodynamic diagrams corresponding to each reference feature diagram, and acquiring candidate feature diagrams of each reference feature diagram according to the thermodynamic diagrams and the corresponding reference feature diagrams; and acquiring key point information of the target object according to all the candidate feature graphs. Thereby, accuracy of the key point recognition is improved.

Description

Method and device for identifying key points and storage medium
Technical Field
The disclosure relates to the technical field of artificial intelligence such as deep learning and computer vision, and in particular relates to a method and a device for identifying key points and a storage medium.
Background
Key point recognition is an important branch in the field of computer vision, and has important significance, for example, gesture recognition is increasingly used in scenes such as man-machine interaction, sign language translation and the like at present.
In the related art, a lightweight detection algorithm is mainly used for detecting gesture key points in a picture, for example, based on a detection scheme without anchor points, an object center point and a category are firstly predicted, meanwhile, the length and width from four sides of an object frame to the center point are predicted, and finally, a target frame is screened out according to the score. According to a human hand detection scheme based on geometrical reasoning of human body key points, the human body key points are detected firstly, then a vector with the elbow pointing to the wrist is formed according to the positions of the elbow and the wrist, the vector is extended by 1.5 times to be used as a human hand center point, the size of a human hand frame is estimated according to the size of the vector, and finally the key points of the human hand are positioned based on the human hand frame.
However, the anchor-free detection algorithm has a large model, and cannot be deployed on a low-end mobile device, so that the anchor-free detection algorithm can be deployed, and false detection of a non-human hand area cannot be effectively filtered. The human hand detection scheme based on geometric reasoning of human body key points may cause human hand detection to locate an error area due to human body key point estimation errors (such as due to human body overlapping, shielding and other factors), and cause false detection.
Disclosure of Invention
The present disclosure provides a method, apparatus, and storage medium for identifying a keypoint for solving a problem of inaccurate keypoint location.
According to an aspect of the present disclosure, there is provided a method for identifying a key point, including: inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, and obtaining an initial feature map output by each layer of convolutional layers, wherein M is a natural number greater than 2; acquiring a reference feature map corresponding to each convolution layer according to the initial feature map; respectively obtaining thermodynamic diagrams corresponding to each reference feature diagram, and obtaining candidate feature diagrams of each reference feature diagram according to the thermodynamic diagrams and the corresponding reference feature diagrams; and acquiring key point information of the target object according to all the candidate feature images.
According to another aspect of the present disclosure, there is provided an apparatus for identifying a key point, including: the first acquisition module is used for inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, and acquiring an initial characteristic diagram output by each layer of convolutional layers, wherein M is a natural number greater than 2; the second acquisition module is used for acquiring a reference feature map corresponding to each convolution layer according to the initial feature map; the third acquisition module is used for respectively acquiring thermodynamic diagrams of each reference feature map; a fourth obtaining module, configured to obtain candidate feature maps of each reference feature map according to the thermodynamic diagram and a corresponding reference feature map; and a fifth acquisition module, configured to acquire key point information of the target object according to all the candidate feature maps.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of identifying a keypoint of the above-described embodiment.
According to still another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of identifying a key point of the above-described embodiment.
According to a further aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of identifying keypoints according to any of claims 1-8.
The embodiment of the invention has at least the following additional technical effects:
inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, acquiring an initial feature map output by each layer of convolutional layers, wherein M is a natural number larger than 2, acquiring a reference feature map corresponding to each layer of convolutional layers according to the initial feature map, respectively acquiring a thermodynamic diagram corresponding to each reference feature map, acquiring candidate feature maps of each reference feature map according to the thermodynamic diagrams and the corresponding reference feature maps, and further acquiring key point information of the target object according to all the candidate feature maps. Therefore, the details of objects with all sizes in the picture are reserved, the target object is positioned together by the details with all sizes, and the accuracy of key point identification is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a method of identifying keypoints according to a first embodiment of the disclosure;
fig. 2 is a schematic structural view of a convolutional neural network according to a second embodiment of the present disclosure;
FIG. 3 is a thermodynamic diagram illustrating a scenario according to a third embodiment of the present disclosure;
fig. 4 is a schematic diagram of an acquisition flow of a reference feature map according to a fourth embodiment of the present disclosure;
fig. 5 is an upsampling scene schematic diagram according to a fifth embodiment of the present disclosure;
FIG. 6 is a schematic diagram of an identification scenario of a keypoint in accordance with a sixth embodiment of the present disclosure;
FIG. 7 is a flow chart of a method of identifying keypoints according to a seventh embodiment of the disclosure;
FIG. 8 is a flow chart of a method of identifying keypoints according to an eighth embodiment of the disclosure;
fig. 9 is a block diagram of a structure of an identification device of a key point according to a ninth embodiment of the present disclosure;
fig. 10 is a block diagram of an electronic device for implementing a method of identifying keypoints of embodiments of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. In order to solve the technical problem of inaccurate positioning of key points in the background technology, the invention provides a scheme capable of accurately positioning the key points.
The following describes a method, an apparatus and a storage medium for identifying key points according to an embodiment of the present invention with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method of identifying keypoints according to one embodiment of the invention, as shown in FIG. 1, the method comprising:
step 101, inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, and obtaining an initial feature map output by each layer of convolutional layers, wherein M is a natural number larger than 2.
The target object can be any object to be positioned such as a human hand, an animal and the like.
It should be understood that, as shown in fig. 2, the preset convolutional neural network includes M convolutional layers, where each layer of convolutional layer extracts a two-dimensional initial feature map, the size of the initial feature map gradually decreases, and the sizes of the initial feature maps of different layers are different, so that different details of the pictures are necessarily retained, where, in some possible embodiments, in order to facilitate the subsequent fusion process, the step size of the convolutional layers may be preset, for example, the step size of the convolutional layers may be set to be 4.
Step 102, obtaining a reference feature map corresponding to each convolution layer according to the initial feature map.
It should be understood that, since the initial feature map of each layer of convolution layer may retain different details in the picture, in this embodiment, a corresponding reference feature map is obtained according to the reference feature map corresponding to each layer of convolution layer, and two-dimensional feature maps for different layers of different sizes are all used as references for locating key points.
Step 103, obtaining thermodynamic diagrams corresponding to each reference feature map respectively, and obtaining candidate feature maps of each reference feature map according to the thermodynamic diagrams and the corresponding reference feature maps.
It will be appreciated that using thermodynamic diagrams can be presented in the form of weights, which portion of the picture is most activated by the neural network. For example, when the target object is a cat or a dog, if the convolutional neural network is judged to be a cat, then its thermodynamic diagram is generally dispersed on the cat, and as for the fact that it is judged according to which part of the cat, the principle of thermodynamic diagram is utilized.
In one embodiment of the present invention, as shown in fig. 3, a convolution operation with a preset number of layers may be performed on each reference feature map, so as to obtain a two-dimensional feature map corresponding to each layer of convolution layer, and then, weight and sum all the two-dimensional feature maps, so as to obtain a thermodynamic diagram corresponding to each reference feature map.
In another embodiment of the present invention, since it is considered that in the field of deep learning technology, a convolutional neural network is typically used for classification, the last layer is typically a softmax layer, and the maximum value corresponds to the classification category. In this embodiment, starting from the node of the highest probability classification category, back propagation is performed, a gradient is obtained for the last layer of convolution layer, then a mean value is obtained for each feature map, finally we take out the activation value of the last layer of convolution layer, and multiply the mean value of the gradient feature map by the previous we, which can be understood that the multiplication of the importance degree of each channel and the convolution activation value of we is just a weighting operation.
Further, in the present embodiment, candidate feature maps of each reference feature map are obtained according to the thermodynamic diagrams and the corresponding reference feature maps, and in some possible examples, with continued reference to fig. 3, thermodynamic diagram points are multiplied by the corresponding reference feature maps to obtain candidate feature maps of each reference feature map, it can be understood that in the present embodiment, the thermodynamic diagrams and the reference feature maps are two feature maps with the same size, and the point multiplication is calculated as: and obtaining weights of each feature point at the same position in the two feature maps, multiplying the weights of the same position to obtain weights of the corresponding feature points in the candidate feature maps, and accordingly performing the point multiplication calculation on each feature point to obtain the corresponding candidate feature map.
In other possible embodiments, the thermodynamic diagrams are superimposed on the corresponding reference feature diagrams to obtain candidate feature diagrams of each reference feature diagram, and it can be understood that in this embodiment, the thermodynamic diagrams and the reference feature diagrams are two feature diagrams with the same resolution, and the superposition is calculated as: and obtaining weights of each feature point at the same position in the two feature diagrams, adding the weights at the same position to remove the average value to obtain the weights of the candidate feature diagrams of the corresponding feature points, and accordingly, carrying out the superposition calculation on each feature point to obtain the corresponding candidate feature diagrams.
And 104, acquiring key point information of the target object according to all the candidate feature maps.
It is easy to understand that each candidate feature map retains details of objects of various sizes in the picture, so that key point information of the target object is obtained together according to all candidate feature maps, and details of the objects of various sizes mutually assist in confirming whether the target object is contained or not, so that determination accuracy of the key point information is improved.
For example, when the target object is a human hand, in this embodiment, an initial feature map output by each layer of convolution layer in the picture is extracted, after a reference feature map corresponding to each layer of convolution layer is obtained according to the initial feature map, the reference feature map is predicted to obtain a thermodynamic diagram of 21 personal key points after passing through several layers of convolution layers, and then the thermodynamic diagram is multiplied to the original reference feature map, so that features around the personal key points can be enhanced, and the features can affect each other, thereby helping to more accurately locate the position of the human hand. The key points of the human body can be used as important basis for whether the human body exists in the picture, so that the key points can help to screen out false detection situations that some people do not exist but human hands are detected.
In summary, according to the method for identifying key points in the embodiment of the invention, a picture containing a target object is input into a preset convolutional neural network comprising M layers of convolutional layers, an initial feature map output by each layer of convolutional layers is obtained, M is a natural number larger than 2, a reference feature map corresponding to each layer of convolutional layers is obtained according to the initial feature map, a thermodynamic diagram corresponding to each reference feature map is obtained respectively, candidate feature maps of each reference feature map are obtained according to the thermodynamic diagrams and the corresponding reference feature maps, and further key point information of the target object is obtained according to all candidate feature maps. Therefore, the details of objects with all sizes in the picture are reserved, the target object is positioned together by the details with all sizes, and the accuracy of key point identification is improved.
In connection with the above description, each of the convolution layers in this embodiment participates in localization, and thus the localization logic of the present invention can be effectively seen as a multi-layer detector, which has been shown by many detection algorithms to be significantly beneficial for detecting objects of different dimensions. However, existing anchor-free detection algorithms, particularly at the mobile end, perform object detection classification prediction based on a single-layer two-dimensional feature map (typically the layer with a step size of 4). The fact that volumes of different scales are all put on the same layer of prediction causes a problem that the convolution of the output layer does not adaptively cover the feature areas of different objects, which can cause features of small objects to protect many background features, while features of large objects may be truncated. Therefore, based on an anchor-free detection algorithm, the invention adds a plurality of layers of prediction components, and predicts target objects with different scales by using two-dimensional feature maps of different layers.
Under different application scenarios, the mode of acquiring the reference feature map corresponding to each convolution layer based on the multi-layer detection logic is different:
first kind:
in this example, referring to fig. 4, assuming that the convolution layer is an M layer, taking the initial feature map of the M-th layer convolution layer as the corresponding reference feature map, when the convolution layer is an M-1 th layer convolution layer, obtaining the first size information of the initial feature map of the M-1 th layer convolution layer, where parameters such as the number of volume sets of each layer of convolution layer may be adjusted in advance, so as to implement that the size information of the feature map obtained by each layer of convolution layer may be a size that is convenient to calculate, for example, 64×64, 32×32, 8×8, 4*4, and the like.
And upsampling the initial feature map of the M-th layer convolution layer according to the first size information to obtain a first sampled feature map, wherein the first sampled feature map is necessarily larger than the initial feature map of the M-th layer convolution layer in size and is the same as the initial feature map of the M-1 layer convolution layer.
And further, fusing the first sampling feature map and the initial feature map of the M-layer convolution layer to obtain a corresponding reference feature map. The fusion may be weighting of the pixel points at the same position, or may be the sum and the multiplication of the feature values of the feature points at the same position.
In this embodiment, with continued reference to fig. 4, when the convolution layer is an nth convolution layer, where n is greater than 2 and less than M-1, a reference feature map of an n+1th convolution layer is obtained, and second size information of an initial feature map of the nth convolution layer is obtained, where the second size information is necessarily greater than the first size information.
Further, the reference feature map of the n+1th layer convolution layer is up-sampled according to the second size information to obtain a second sampled feature map, and the second sampled feature map and the initial feature map of the n layer convolution layer are fused to obtain a corresponding reference feature map. The fusion may be weighting of the pixel points at the same position, or may be the sum and the multiplication of the feature values of the feature points at the same position.
In this embodiment, in the process of upsampling the reference feature map of the convolution layer according to the related size information, different processing manners may be available in different scenes, and the following description is given by taking the size information as the first size information, and the convolution layer is exemplified by the M-th convolution layer, which is described as follows:
example one:
in this example, parameters such as the volume set number of the transposed convolutional neural network are determined according to the first size information, so that after the transposed convolutional neural network transposes the initial feature map of the mth layer convolutional layer, a feature map with the same size as the first size information is obtained, and therefore, in this embodiment, the initial feature map of the mth layer convolutional layer is input to the transposed convolutional neural network adjusted according to the first size information, and a first sampling feature map output by the transposed convolutional neural network is obtained.
Example two:
in this embodiment, the initial feature map of the mth convolution layer is up-sampled based on an up-sampling algorithm and the first size information, and a first sampled feature map is obtained.
Example three:
in this example, referring to fig. 5, parameters such as the volume set number of the transposed convolutional neural network are determined according to the first size information, so that after the transposed convolutional neural network transposes the initial feature map of the M-th layer convolutional layer, a feature map with the same size as the first size information is obtained, so in this embodiment, the initial feature map of the M-th layer convolutional layer is input to the transposed convolutional neural network (deconv in the map) adjusted according to the first size information, and a first candidate sampling feature map output by the transposed convolutional neural network is obtained. And upsampling (resize in the figure) the initial feature map of the M-th layer convolutional layer based on the upsampling algorithm and the first size information to obtain a second candidate sampled feature map.
Further, the first candidate sampling feature map and the second candidate sampling feature map are combined to obtain a first sampling feature map, for example, feature values of feature points at the same position of the first candidate sampling feature map and the second candidate sampling feature map are weighted and averaged, so that the first candidate sampling feature map and the second candidate sampling feature map are combined.
Second kind:
in this example, in order to further emphasize the semantics of the feature map output by the convolution layer, the weight of each feature point in each initial feature map is obtained, if the weight is greater than a certain value, a fixed value is added to the weight, and no processing is performed on the feature points with the weight less than the certain value, so that a reference feature map corresponding to each initial feature map is obtained.
Further, according to the above embodiment, after the reference feature map is obtained, the superposition calculation of the thermodynamic diagram may be performed for the reference feature map corresponding to each convolution layer, where the superposition calculation of the thermodynamic diagram in this embodiment refers to obtaining the thermodynamic diagram corresponding to each reference feature map, and obtaining the candidate feature map of each reference feature map according to the thermodynamic diagram and the corresponding reference feature map, where a specific process may refer to fig. 6, where the upsampling process in fig. 6 refers to fig. 5, and "kpt attn" in fig. 6 indicates the superposition calculation of the thermodynamic diagram.
In summary, the method for identifying the key points in the embodiment of the invention can flexibly further excavate the initial feature map of each layer of rolling machine layer in different modes according to scene requirements to obtain corresponding reference feature maps, and the reference feature maps further emphasize details of objects with various sizes, thereby further improving the accuracy of positioning the key points.
Based on the above embodiment, in different application scenarios, the manner of acquiring the key point information of the target object according to all the candidate feature maps is also different, and examples are as follows:
example one:
in this example, as shown in fig. 7, acquiring key point information of the target object according to all candidate feature maps includes:
step 701, inputting each candidate feature map into a preset convolutional neural network, so as to determine whether each candidate feature map contains a target object according to an output result of the preset convolutional neural network.
In this embodiment, a preset convolutional neural network is obtained by training a large amount of sample data in advance, each candidate feature map is input into the preset convolutional neural network, so as to determine whether each candidate feature map contains a target object according to an output result of the preset convolutional neural network, that is, whether the candidate feature map with a probability value greater than a preset value contains the target object according to a probability value of whether the target object is contained or not output by the convolutional neural network.
Step 702, determining candidate feature graphs including the target object, and fusing all candidate feature graphs including the target object to obtain a first target feature graph.
In this embodiment, all candidate feature maps including the target object are fused to obtain a first target feature map, for example, feature values of all feature points in the same position of all candidate feature maps including the target object are subjected to weighted-free averaging to obtain the first target feature map.
In step 703, key point information of the target object is obtained according to the first target feature map.
In this embodiment, the first target feature map may be input into a preset deep learning network to obtain key point information of the target object, where the key point information includes coordinate information of key points of the target object, for example, coordinate values of 21 key points of a human hand, and the like.
Example two:
in this example, as shown in fig. 8, acquiring key point information of a target object according to all candidate feature maps includes:
step 801, fusing all candidate feature graphs, and obtaining a second target feature graph.
In some possible embodiments, feature values of all feature points at the same location of all candidate feature maps are de-weighted averaged to obtain the second target feature map.
Step 802, inputting the second target feature map into a preset convolutional neural network to obtain key point information of the target object.
In this embodiment, the second target feature map is input into a preset convolutional neural network to obtain the key point information of the target object, where the key point information includes coordinate information of the key points of the target object, for example, coordinate values of 21 key points of a human hand, and the like.
In summary, the key point identification method of the embodiment of the invention fuses candidate feature graphs corresponding to the multi-layer convolution layers to acquire the key point information of the target object, thereby improving the accuracy of positioning the key points.
In order to achieve the above embodiment, the present invention further provides a device for identifying a key point, and fig. 9 is a schematic structural diagram of a device for identifying a key point according to an embodiment of the present invention, as shown in fig. 9, where the device for identifying a key point includes: a first acquisition module 910, a second acquisition module 920, a third acquisition module 930, a fourth acquisition module 940, and a fifth acquisition module 950, where,
a first obtaining module 910, configured to input a picture including a target object into a preset convolutional neural network including M convolutional layers, and obtain an initial feature map output by each convolutional layer, where M is a natural number greater than 2;
a second obtaining module 920, configured to obtain a reference feature map corresponding to each convolution layer according to the initial feature map;
a third obtaining module 930, configured to obtain thermodynamic diagrams corresponding to each of the reference feature maps respectively;
a fourth obtaining module 940, configured to obtain a candidate feature map of each reference feature map according to the thermodynamic diagram and the corresponding reference feature map;
a fifth obtaining module 950, configured to obtain key point information of the target object according to all the candidate feature maps.
In one embodiment of the present invention, the third obtaining module 930 is specifically configured to:
carrying out convolution operation of a preset layer number on each reference feature map to obtain a two-dimensional feature map corresponding to each convolution layer;
and carrying out weighted summation on all the two-dimensional feature maps to obtain thermodynamic diagrams corresponding to each reference feature map.
In one embodiment of the present invention, the fourth obtaining module 940 is specifically configured to:
multiplying the thermodynamic diagram points by the corresponding reference feature maps to obtain candidate feature maps of each reference feature map.
It should be noted that the foregoing explanation of the method for identifying the key points is also applicable to the device for identifying the key points in the embodiment of the present invention, and the implementation principle is similar and will not be repeated here.
In summary, the device for identifying key points in the embodiment of the invention inputs a picture containing a target object into a preset convolutional neural network including M layers of convolutional layers, acquires initial feature images output by each layer of convolutional layers, acquires reference feature images corresponding to each layer of convolutional layers according to the initial feature images, acquires thermodynamic diagrams corresponding to each reference feature image respectively, acquires candidate feature images of each reference feature image according to the thermodynamic diagrams and the corresponding reference feature images, and further acquires key point information of the target object according to all candidate feature images. Therefore, the details of objects with all sizes in the picture are reserved, the target object is positioned together by the details with all sizes, and the accuracy of key point identification is improved.
In one embodiment of the present invention, when the convolution layer is an mth layer convolution layer, the second obtaining module 920 is specifically configured to: and taking the initial characteristic diagram of the M th convolution layer as a corresponding reference characteristic diagram.
In one embodiment of the present invention, when the convolution layer is an M-1 layer convolution layer, the second obtaining module 920 is specifically configured to: acquiring first size information of an initial feature map of an M-1 layer convolution layer;
up-sampling the initial feature map of the M th convolution layer according to the first size information to obtain a first sampling feature map;
and fusing the first sampling feature map and the initial feature map of the M-layer convolution layer to obtain a corresponding reference feature map.
In one embodiment of the present invention, when the convolution layer is an nth layer convolution layer, where n is greater than 2 and less than M-1, the second obtaining module 920 is specifically configured to:
acquiring a reference feature map of an n+1th layer convolution layer, and acquiring second size information of an initial feature map of the n layer convolution layer;
upsampling the reference feature map of the n+1th convolution layer according to the second size information to obtain a second sampled feature map;
and fusing the second sampling feature map and the initial feature map of the n-layer convolution layer to obtain a corresponding reference feature map.
It should be noted that the foregoing explanation of the method for identifying the key points is also applicable to the device for identifying the key points in the embodiment of the present invention, and the implementation principle is similar and will not be repeated here.
In summary, the device for identifying the key points in the embodiment of the invention can flexibly further excavate the initial feature map of each layer of rolling machine layer in different modes according to scene requirements to obtain corresponding reference feature maps, and the reference feature maps further emphasize details of objects with various sizes, thereby further improving the accuracy of positioning the key points.
In one embodiment of the present invention, the fifth obtaining module 950 is specifically configured to:
inputting each candidate feature map into a preset convolutional neural network so as to determine whether each candidate feature map contains a target object according to an output result of the preset convolutional neural network;
determining candidate feature images containing target objects, and fusing all the candidate feature images containing the target objects to obtain a first target feature image;
and acquiring key point information of the target object according to the first target feature map.
In one embodiment of the present invention, the fifth obtaining module 950 is specifically configured to:
fusing all candidate feature images to obtain a second target feature image;
and inputting the second target feature map into a preset convolutional neural network to acquire key point information of the target object.
It should be noted that the foregoing explanation of the method for identifying the key points is also applicable to the device for identifying the key points in the embodiment of the present invention, and the implementation principle is similar and will not be repeated here.
In summary, the key point identification device provided by the embodiment of the invention fuses candidate feature graphs corresponding to the multi-layer convolution layers to acquire the key point information of the target object, thereby improving the accuracy of positioning the key points.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as identification of method keypoints. For example, in some embodiments, the identification of method keypoints may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM1003 and executed by computing unit 1001, one or more steps of the identification of method keypoints described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the identification of the method keypoints by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
The server may also be a server of a distributed system or a server that incorporates a blockchain.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above-described method of identifying keypoints.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (12)

1. A method of identifying keypoints, comprising:
inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, and obtaining an initial feature map output by each layer of convolutional layers, wherein M is a natural number greater than 2;
acquiring a reference feature map corresponding to each convolution layer according to the initial feature map;
respectively obtaining thermodynamic diagrams corresponding to each reference feature diagram, and obtaining candidate feature diagrams of each reference feature diagram according to the thermodynamic diagrams and the corresponding reference feature diagrams;
acquiring key point information of the target object according to all the candidate feature images;
when the convolution layer is an mth convolution layer, the obtaining, according to the initial feature map, a reference feature map corresponding to each convolution layer includes:
taking the initial feature map of the M th convolution layer as the corresponding reference feature map;
when the convolution layer is an M-1 th convolution layer, the obtaining, according to the initial feature map, a reference feature map corresponding to each convolution layer includes:
acquiring first size information of an initial feature map of an M-1 layer convolution layer;
up-sampling the initial feature map of the M th convolution layer according to the first size information to obtain a first sampling feature map;
fusing the first sampling feature map and the initial feature map of the M-layer convolution layer to obtain the corresponding reference feature map;
when the convolution layer is an nth convolution layer, where n is greater than 2 and less than M-1, the obtaining, according to the initial feature map, a reference feature map corresponding to each convolution layer includes:
acquiring a reference feature map of an n+1th layer convolution layer, and acquiring second size information of an initial feature map of the n layer convolution layer;
upsampling the reference feature map of the n+1th layer convolutional layer according to the second size information to obtain a second sampled feature map;
and fusing the second sampling feature map and the initial feature map of the n-layer convolution layer to acquire the corresponding reference feature map.
2. The method of claim 1, wherein the separately obtaining thermodynamic diagrams with each of the reference feature maps comprises:
carrying out convolution operation of a preset layer number on each reference feature map to obtain a two-dimensional feature map corresponding to each convolution layer;
and carrying out weighted summation on all the two-dimensional feature images to obtain thermodynamic diagrams corresponding to each reference feature image.
3. The method of claim 1, wherein the obtaining candidate feature maps for each of the reference feature maps from the thermodynamic diagrams and corresponding reference feature maps comprises:
multiplying the thermodynamic diagram points by corresponding reference feature maps to obtain candidate feature maps of each reference feature map.
4. The method of claim 1, wherein the obtaining keypoint information of the target object from all of the candidate feature maps comprises:
inputting each candidate feature map into a preset convolutional neural network, and determining whether each candidate feature map contains the target object according to an output result of the preset convolutional neural network;
determining candidate feature images containing the target object, and fusing all the candidate feature images containing the target object to obtain a first target feature image;
and acquiring key point information of the target object according to the first target feature map.
5. The method of claim 1, wherein the obtaining keypoint information of the target object from all of the candidate feature maps comprises:
fusing all the candidate feature images to obtain a second target feature image;
and inputting the second target feature map into a preset convolutional neural network to acquire key point information of the target object.
6. A key point identification device, comprising:
the first acquisition module is used for inputting a picture containing a target object into a preset convolutional neural network comprising M layers of convolutional layers, and acquiring an initial characteristic diagram output by each layer of convolutional layers, wherein M is a natural number greater than 2;
the second acquisition module is used for acquiring a reference feature map corresponding to each convolution layer according to the initial feature map;
the third acquisition module is used for respectively acquiring thermodynamic diagrams of each reference feature map;
a fourth obtaining module, configured to obtain candidate feature maps of each reference feature map according to the thermodynamic diagram and a corresponding reference feature map;
a fifth obtaining module, configured to obtain key point information of the target object according to all the candidate feature maps;
when the convolution layer is an mth convolution layer, the second obtaining module is specifically configured to:
taking the initial feature map of the M th convolution layer as the corresponding reference feature map;
when the convolution layer is an M-1 layer convolution layer, the second obtaining module is specifically configured to:
acquiring first size information of an initial feature map of an M-1 layer convolution layer;
up-sampling the initial feature map of the M th convolution layer according to the first size information to obtain a first sampling feature map;
fusing the first sampling feature map and the initial feature map of the M-layer convolution layer to obtain the corresponding reference feature map;
when the convolution layer is an nth convolution layer, where n is greater than 2 and less than M-1, the second obtaining module is specifically configured to:
acquiring a reference feature map of an n+1th layer convolution layer, and acquiring second size information of an initial feature map of the n layer convolution layer;
upsampling the reference feature map of the n+1th layer convolutional layer according to the second size information to obtain a second sampled feature map;
and fusing the second sampling feature map and the initial feature map of the n-layer convolution layer to acquire the corresponding reference feature map.
7. The apparatus of claim 6, wherein the third acquisition module is specifically configured to:
carrying out convolution operation of a preset layer number on each reference feature map to obtain a two-dimensional feature map corresponding to each convolution layer;
and carrying out weighted summation on all the two-dimensional feature images to obtain thermodynamic diagrams corresponding to each reference feature image.
8. The apparatus of claim 6, wherein the fourth acquisition module is specifically configured to:
multiplying the thermodynamic diagram points by corresponding reference feature maps to obtain candidate feature maps of each reference feature map.
9. The apparatus of claim 6, wherein the fifth acquisition module is specifically configured to:
inputting each candidate feature map into a preset convolutional neural network, and determining whether each candidate feature map contains the target object according to an output result of the preset convolutional neural network;
determining candidate feature images containing the target object, and fusing all the candidate feature images containing the target object to obtain a first target feature image;
and acquiring key point information of the target object according to the first target feature map.
10. The apparatus of claim 6, wherein the fifth acquisition module is specifically configured to:
fusing all the candidate feature images to obtain a second target feature image;
and inputting the second target feature map into a preset convolutional neural network to acquire key point information of the target object.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of identifying a keypoint of any one of claims 1-5.
12. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of identifying a keypoint according to any one of claims 1-5.
CN202110085505.2A 2021-01-21 2021-01-21 Method and device for identifying key points and storage medium Active CN112784743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110085505.2A CN112784743B (en) 2021-01-21 2021-01-21 Method and device for identifying key points and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110085505.2A CN112784743B (en) 2021-01-21 2021-01-21 Method and device for identifying key points and storage medium

Publications (2)

Publication Number Publication Date
CN112784743A CN112784743A (en) 2021-05-11
CN112784743B true CN112784743B (en) 2023-08-04

Family

ID=75758458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110085505.2A Active CN112784743B (en) 2021-01-21 2021-01-21 Method and device for identifying key points and storage medium

Country Status (1)

Country Link
CN (1) CN112784743B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332509B (en) * 2021-12-29 2023-03-24 阿波罗智能技术(北京)有限公司 Image processing method, model training method, electronic device and automatic driving vehicle

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705365A (en) * 2019-09-06 2020-01-17 北京达佳互联信息技术有限公司 Human body key point detection method and device, electronic equipment and storage medium
CN111126379A (en) * 2019-11-22 2020-05-08 苏州浪潮智能科技有限公司 Target detection method and device
CA3032983A1 (en) * 2019-02-06 2020-08-06 Thanh Phuoc Hong Systems and methods for keypoint detection
CN111695519A (en) * 2020-06-12 2020-09-22 北京百度网讯科技有限公司 Key point positioning method, device, equipment and storage medium
CN111860276A (en) * 2020-07-14 2020-10-30 咪咕文化科技有限公司 Human body key point detection method, device, network equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229509B (en) * 2016-12-16 2021-02-26 北京市商汤科技开发有限公司 Method and device for identifying object class and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3032983A1 (en) * 2019-02-06 2020-08-06 Thanh Phuoc Hong Systems and methods for keypoint detection
CN110705365A (en) * 2019-09-06 2020-01-17 北京达佳互联信息技术有限公司 Human body key point detection method and device, electronic equipment and storage medium
CN111126379A (en) * 2019-11-22 2020-05-08 苏州浪潮智能科技有限公司 Target detection method and device
CN111695519A (en) * 2020-06-12 2020-09-22 北京百度网讯科技有限公司 Key point positioning method, device, equipment and storage medium
CN111860276A (en) * 2020-07-14 2020-10-30 咪咕文化科技有限公司 Human body key point detection method, device, network equipment and storage medium

Also Published As

Publication number Publication date
CN112784743A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
JP6902611B2 (en) Object detection methods, neural network training methods, equipment and electronics
US20210158533A1 (en) Image processing method and apparatus, and storage medium
US20190311223A1 (en) Image processing methods and apparatus, and electronic devices
CN109344789B (en) Face tracking method and device
CN112800915B (en) Building change detection method, device, electronic equipment and storage medium
WO2022213718A1 (en) Sample image increment method, image detection model training method, and image detection method
US11669990B2 (en) Object area measurement method, electronic device and storage medium
CN113705362B (en) Training method and device of image detection model, electronic equipment and storage medium
CN113378712B (en) Training method of object detection model, image detection method and device thereof
CN113239928A (en) Method, apparatus and program product for image difference detection and model training
CN113947188A (en) Training method of target detection network and vehicle detection method
US20230245429A1 (en) Method and apparatus for training lane line detection model, electronic device and storage medium
CN112784743B (en) Method and device for identifying key points and storage medium
CN114120454A (en) Training method and device of living body detection model, electronic equipment and storage medium
CN113837257A (en) Target detection method and device
CN113627298A (en) Training method of target detection model and method and device for detecting target object
CN112488126A (en) Feature map processing method, device, equipment and storage medium
CN113610856B (en) Method and device for training image segmentation model and image segmentation
CN114821777A (en) Gesture detection method, device, equipment and storage medium
CN114220163A (en) Human body posture estimation method and device, electronic equipment and storage medium
CN114093006A (en) Training method, device and equipment of living human face detection model and storage medium
CN113205092A (en) Text detection method, device, equipment and storage medium
CN115456167B (en) Lightweight model training method, image processing device and electronic equipment
CN116453221B (en) Target object posture determining method, training device and storage medium
CN114580631B (en) Model training method, smoke and fire detection method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant