CN113838134A

CN113838134A - Image key point detection method, device, terminal and storage medium

Info

Publication number: CN113838134A
Application number: CN202111131548.6A
Authority: CN
Inventors: 吴家贤
Original assignee: Guangzhou Boguan Information Technology Co Ltd
Current assignee: Guangzhou Boguan Information Technology Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-24
Anticipated expiration: 2041-09-26
Also published as: CN113838134B

Abstract

The embodiment of the application discloses a method, a device, a terminal and a storage medium for detecting image key points; the method comprises the steps of obtaining a target image and presetting a characteristic diagram; performing convolution processing on a target image to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to characteristic image pixels in a preset characteristic image one to one; weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted pixel coordinates of the characteristic image pixels corresponding to the matrix elements; summing the weighted pixel coordinates to obtain coordinates of key points in a preset feature map; and determining the coordinates of the key points in the preset feature map mapped in the target image. According to the method and the device for detecting the key points of the image, the coordinates of the key points of the target image are obtained only through a small amount of calculation, and the obtaining efficiency and the coordinate accuracy of the coordinates of the key points in the target image are improved.

Description

Image key point detection method, device, terminal and storage medium

Technical Field

The application relates to the field of computers, in particular to an image key point detection method, an image key point detection device, a terminal and a storage medium.

Background

In recent years, in computer vision tasks, a gaussian heat map method and a direct regression method are used to detect key points in an image and obtain coordinates corresponding to the key points in the image. The Gaussian heat map method outputs a feature map through a convolution neural network, the position with the maximum value on the feature map is regarded as the position of a key point, and the maximum independent variable point set (argmax) is solved for the feature map to obtain the coordinates of the key point. The direct regression method is to directly output the required coordinate value by adopting a full connection layer.

However, at present, the detection of the coordinates of the key points in the image is complex, which results in low efficiency of obtaining the coordinates of the key points in the image and large errors of the obtained coordinates of the key points in the image.

Disclosure of Invention

The embodiment of the application provides an image key point detection method, an image key point detection device, a terminal and a storage medium, and can improve the coordinate obtaining efficiency and the coordinate precision of key points in a feature map corresponding to a target graph.

The embodiment of the application provides an image key point detection method, which comprises the following steps:

acquiring a target image and a preset feature map;

performing convolution processing on a target image to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to characteristic image pixels in a preset characteristic image one to one;

weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted pixel coordinates of the characteristic image pixels corresponding to the matrix elements;

summing the weighted pixel coordinates to obtain coordinates of key points in a preset feature map;

and determining the coordinates of the key points in the preset feature map, which are mapped in the target image.

The embodiment of the present application further provides an image key point detection device, including:

the acquisition unit is used for acquiring a target image and a preset characteristic diagram;

the output matrix unit is used for performing convolution processing on the target image to obtain an output matrix, and the output matrix consists of a plurality of matrix elements, wherein the matrix elements correspond to characteristic image pixels in a preset characteristic image one by one;

the coordinate weighting unit is used for weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted coordinates of the characteristic image pixels corresponding to the matrix elements;

the coordinate determination unit is used for summing the weighted pixel coordinates to obtain the coordinates of key points in the preset feature map;

and the coordinate mapping unit is used for determining the coordinates of the key points in the preset feature map mapped in the coordinates of the key points in the target image.

In some embodiments, determining a new training feature map corresponding to the preset training feature map based on the key points includes:

acquiring a resolution ratio between a preset feature map and a target image;

and amplifying the coordinates of the key points according to the resolution ratio to obtain the coordinates of the key points mapped in the target image.

In some embodiments, an output matrix unit to:

acquiring color parameters corresponding to each image pixel in a target image;

and carrying out convolution processing on the color parameters to obtain matrix elements.

In some embodiments, before acquiring the target image and the preset feature map, the method includes:

acquiring a plurality of training data sets and a key point detection network, wherein the key point detection network is used for predicting coordinates of key points in images, the training data sets are composed of a plurality of training images, and the labels of the training images are the coordinates of points to be processed in the training images;

training the key point detection network by using a plurality of training data sets until the key point detection network is converged to obtain a trained key point detection network;

performing convolution processing on a target image to obtain an output matrix, wherein the convolution processing comprises the following steps:

and performing convolution processing on the target image by adopting a key point detection network to obtain an output matrix.

In some embodiments, training a keypoint detection network with a plurality of training data sets comprises:

acquiring a preset training characteristic diagram, wherein the preset training characteristic diagram is annotated with real coordinates of key points, and the real coordinates of the key points are coordinates of points to be processed of a training image, which are mapped on the preset training characteristic diagram;

determining a new training feature map corresponding to the preset training feature map based on the real coordinates of the key points of the preset training feature map;

summing the coordinates of all pixels in the new training feature map to obtain the predicted coordinates of the key points corresponding to the new training feature map;

and determining loss parameters of the key point detection network by adopting the real coordinates of the key points of the preset training characteristic diagram and the predicted coordinates of the key points corresponding to the new training characteristic diagram, and training the key point detection network based on the loss parameters.

In some embodiments, determining a new training feature map corresponding to the preset training feature map based on the real coordinates of the key points of the preset training feature map includes:

determining a candidate region corresponding to the key points, wherein the candidate region comprises the key points;

determining the weight of another diagonal vertex in the diagonal vertex set according to the coordinates of the diagonal vertex in the diagonal vertex set and the real coordinates of the key points, wherein the diagonal vertex set comprises the diagonal vertices which are two vertices located on the same diagonal of the candidate area, and the vertices are pixels;

according to the weight, carrying out weighting processing on the coordinates of the pixels in the candidate area to obtain weighted coordinates;

and replacing the coordinates of the pixels in the corresponding candidate region before weighting processing by adopting the weighted coordinates of the pixels in the candidate region to obtain a new training feature map.

In some embodiments, obtaining a preset training feature map, where the preset training feature icon is annotated with real coordinates of key points, where the real coordinates of the key points are coordinates of a point to be processed of the training image mapped on the preset training feature map includes:

acquiring coordinates and image resolution of a point to be processed in a target training image;

normalizing the coordinates of the points to be processed according to the image resolution of the target training image to obtain normalized coordinates of the points to be processed;

acquiring the resolution of a preset training characteristic diagram;

and calculating the corresponding coordinates of the points to be processed mapped on the preset training feature map according to the resolution and the normalized coordinates of the preset training feature map, wherein the real coordinates corresponding to the preset training feature map comprise the corresponding coordinates of the points to be processed mapped on the preset training feature map.

In some embodiments, determining candidate regions corresponding to the keypoints, the candidate regions including the keypoints comprises:

performing downward rounding on the abscissa and the ordinate of the real coordinate to obtain a rounded abscissa and a rounded ordinate;

expanding the rounding abscissa and the rounding ordinate in a preset unit to obtain an expanded abscissa and an expanded ordinate;

and combining the abscissa and the ordinate in pairs according to the rounding abscissa, the rounding ordinate, the enlarged abscissa and the enlarged ordinate to obtain learning training pixel coordinates, wherein the abscissa comprises the rounding abscissa or the enlarged abscissa, and the ordinate comprises the rounding ordinate or the enlarged ordinate.

In some embodiments, determining a weight of a diagonal vertex in the set of diagonal vertices from coordinates of the diagonal vertex and coordinates of the keypoint comprises:

calculating a diagonal vertex horizontal coordinate difference value, wherein the horizontal coordinate difference value is a difference value between the horizontal coordinate of the diagonal vertex and the horizontal coordinate of the key point;

calculating a vertical coordinate difference value of a diagonal vertex, wherein the vertical coordinate difference value is a difference value between the vertical coordinate of the diagonal vertex and the vertical coordinate of the key point;

and multiplying the horizontal coordinate difference value and the vertical coordinate difference value of the diagonal vertex to obtain the weight of the other diagonal vertex.

The embodiment of the application also provides a terminal, which comprises a memory and a control unit, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to execute the steps in any image keypoint detection method provided by the embodiment of the application.

The embodiment of the present application further provides a computer-readable storage medium, where a plurality of instructions are stored in the computer-readable storage medium, and the instructions are suitable for being loaded by a processor to perform any of the steps in the image keypoint detection method provided in the embodiment of the present application.

According to the embodiment of the application, a target image and a preset feature map can be obtained; performing convolution processing on a target image to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to characteristic image pixels in a preset characteristic image one to one; weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted coordinates of the characteristic image pixels corresponding to the matrix elements; summing the weighted pixel coordinates to obtain coordinates of key points in a preset feature map; and determining the coordinates of the key points in the preset feature map mapped in the target image.

In the application, convolution processing is carried out on a target graph to obtain an output matrix, matrix elements in the output matrix are used for weighting the coordinates of characteristic image pixels in a preset characteristic image to obtain weighted coordinates of the characteristic image pixels corresponding to the matrix elements, finally the weighted pixel coordinates are summed to obtain the coordinates of key points in the preset characteristic image, thereby determining the coordinates of the key points in the preset feature map mapped in the target image, therefore, the method does not adopt a non-end-to-end mode to solve the coordinates of the key points of the feature map like the Gaussian thermal map method, reduces errors possibly generated when the coordinates of the key points of the feature map are calculated from non-end to end, meanwhile, the spatial information of the feature map is not lost when the key coordinates of the feature map are calculated by a direct regression method, so that the spatial generalization is poor, and the accuracy of the coordinates of the key points of the feature map is influenced. According to the method and the device, the coordinates of the key points of the target image are obtained only through a small amount of calculation, and the efficiency of obtaining the coordinates of the key points in the target image and the accuracy of the coordinates are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of a keypoint detection method provided in an embodiment of the present application;

FIG. 1b is a schematic flow chart of a key point detection method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a key point detection apparatus provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a mobile terminal according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides an image key point detection method, an image key point detection device, a terminal and a storage medium.

The key point detection may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or other devices. The terminal can be a mobile phone, a tablet Computer, an intelligent bluetooth device, a notebook Computer, or a Personal Computer (PC), and the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the image keypoint detection apparatus may also be integrated into multiple electronic devices, for example, the image keypoint detection apparatus may be integrated into multiple servers, and the image keypoint detection method of the present application is implemented by the multiple servers.

In some embodiments, the server may also be implemented in the form of a terminal.

Currently, in the gaussian heat map method, the image input to the key point output is not a fully differential model. The coordinates of the key points from the gaussian heat map are obtained offline in a manner of solving a maximum argument point set (argmax) of the feature map, so that the gaussian heat map method adopts a non-end-to-end manner, and more information is easily lost when the coordinates are generated compared with end-to-end manner. For example, if the output gaussian heat map size is 1/4 of the input image size, when the gaussian heat map is reduced, one pixel in the gaussian heat map is reduced to 4 pixels, and the 4 pixels do not have spatial position information, so that when the gaussian heat map is reduced, the 4 pixels are difficult to be reduced to the previous positions, and a pixel coordinate error exists. Meanwhile, the gaussian heat map method generates a high-resolution gaussian heat map, which results in a large amount of calculation and thus large memory consumption.

In the direct regression method, a two-dimensional characteristic diagram is converted into a one-dimensional vector through a full connection layer, so that the characteristic diagram loses spatial information, the spatial generalization is poor, and the requirement on the distribution balance of data is high. Spatial generalization refers to the ability of acquiring one position during model training, and migrating learning to the ability of positioning another position in the reasoning stage, and the accuracy of reasoning is influenced by the poor spatial generalization.

For example, since the coordinates of the key points of the feature map corresponding to the current graph are obtained in the above manner, and the coordinate error is large, an embodiment of the present application provides an image key point detection method, and referring to fig. 1a and 1b, in an embodiment of the present disclosure, the electronic device may be a mobile terminal, the mobile terminal may detect the key points of the image, and the mobile terminal may obtain the target image and preset the feature map; performing convolution processing on a target image to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to characteristic image pixels in a preset characteristic image one to one; weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted pixel coordinates of the characteristic image pixels corresponding to the matrix elements; summing the weighted pixel coordinates to obtain coordinates of key points in a preset feature map; and determining the coordinates of the key points in the preset feature map mapped in the target image.

The method comprises the steps of performing convolution on a target image to obtain an output matrix, weighting coordinates of feature image pixels in a preset feature map by using matrix elements in the output matrix to obtain weighted pixel coordinates, and adding the weighted pixel coordinates corresponding to the preset feature map to obtain coordinates of key points in the preset feature map corresponding to the target image. The method and the device apply the coordinates of the feature image pixels in the preset feature image, do not lose the spatial information of the feature image pixels, are beneficial to improving the accuracy of obtaining the coordinates of the key points of the preset feature image corresponding to the target image, can directly obtain the coordinates of the key points of the target image, do not adopt non-end-to-end calculation to obtain the coordinates of the key points of the target image, and reduce information loss caused by the fact that the coordinates of the key points in the target image are calculated in a non-end-to-end manner. Therefore, the coordinates of the key points of the target image are obtained only through a small amount of calculation, and the obtaining efficiency and the coordinate precision of the coordinates of the key points in the target image are improved.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

In this embodiment, an image key point detection method is provided, as shown in fig. 1b, a specific flow of the image key point detection method may be as follows:

110. and acquiring a target image and a preset feature map.

The target image can be an electronic picture containing features to be recognized, and the features to be recognized can be human face features, hand posture features and the like. The electronic picture may be a picture with RGB channels, or a picture with CMYK channels, or the like. For example, the target image may be an electronic picture of a feature to be identified for keypoint location detection, which involves coordinates of keypoints of the feature to be identified.

For example, the target image may be acquired in a scene of face recognition, hand gesture recognition, image key point prediction, and the like.

The target image may be obtained by a variety of methods, for example, the target image may be obtained locally or may be obtained from a different place according to different obtaining sources. The local storage is realized through the target image, and when the target image is to be processed, the target image can be pulled. The remote place can acquire the target image sent by the storage device by sending an acquisition request to the storage device in which the target image is stored.

The preset feature map may be a preset feature map, and pixels in the preset feature map have no RGB values and may have pixel coordinates.

For example, the preset feature map may be a feature map of 5 × 5, and may also be a feature map of 9 × 9, which is not particularly limited herein.

In some embodiments, to be effective in training the keypoint detection network, step 110 may be preceded by the steps of:

and carrying out convolution processing on the target image by adopting a key point detection network to obtain an output matrix.

Wherein the training data set may be a data set for training a convolutional neural network, the training data set consisting of training parameters. The training data set may be composed of a training image and coordinates of a point to be processed in the training image labeled with the training image. Wherein multiple training data sets may be applied in training the convolutional neural network.

The training data set may be obtained in various ways, for example, the training data set may be obtained locally or may be obtained from a different location. For example, when the terminal needs to acquire multiple training data sets, the terminal may acquire the training data sets from a local storage, or the terminal may retrieve the training data sets from a server remotely. The training data set is stored locally, and when the training parameters in the training data set are to be acquired, the training parameters can be pulled. The training parameters sent by the storage device can be obtained by sending an obtaining request to the storage device storing the training data set in different places.

The key point detection network may be a key point coordinate for predicting a feature to be identified in an image. For example, the coordinates of the position of the hand feature in the predicted image may be specifically coordinates of a key point at the predicted hand feature.

The key point detection network can be applied to hand position recognition, can be particularly applied to entertainment interaction scenes such as short videos and live broadcasts, can achieve various creative playing methods such as hand special effects and space painting based on finger tip point detection and finger bone key point detection, and enriches interaction experience.

The key point detection network may be obtained locally or remotely, for example, according to different obtaining sources. For example, the key point detection network is stored in a local terminal, the terminal can obtain the key point detection network from the local storage, the key point detection network is stored in a server, and the terminal needs to remotely access the server to obtain the key point detection network.

The plurality of training sets of data may be comprised of a plurality of different types of training data sets. The training data set may include a plurality of pictures with different image contents, specifically including pictures that need to be subjected to feature recognition and icons that do not need to be subjected to feature recognition.

The target training images may be multiple types of images, wherein a portion of the target training images contain features to be recognized. For example, when a picture containing hand features is to be subjected to position prediction, the target training image may include an image containing hand features, and may also include an image having other features such as an image of face features.

The key point detection network is trained by different types of training data sets, the recognition capability of the key point detection network can be improved, and when a plurality of output values of the key point detection network are predicted correctly, the key point detection network is converged to obtain the trained key point detection network.

The method for training the keypoint detection network is various, for example, the keypoint detection network can be trained locally or in different places according to different acquisition sources. For example, local training may be performed directly in the terminal, or off-site training may be performed in the server.

In some embodiments, to serve the effect of training the keypoint detection network, the apparatus is further configured to:

The real coordinates of the key points of the preset training feature map may be coordinates of the point to be processed of the training image mapped on the preset training feature map.

For example, the coordinates of the point to be processed are marked on the training image, and the coordinates of the point to be processed are preprocessed, so that the coordinates of the point to be processed on the training image are mapped on the preset training feature map, and thus, the real coordinates of the key point are marked on the preset training feature icon.

The new training feature map may be a feature map corresponding to the coordinates of the pixels in the preset training feature map after weighting.

In some embodiments, to achieve the effect of obtaining a new training feature map, the apparatus is further configured to: determining a candidate region corresponding to the key points, wherein the candidate region comprises the key points;

The candidate region may be a local region of the training feature map, where the local region includes the keypoints. For example, the candidate region may be a region surrounded by four pixels closest to the keypoint.

The coordinates of the key points can be predicted by the pixel coordinates around the key points.

In some embodiments, to achieve the effect of obtaining the candidate region, the apparatus is further configured to:

Wherein the coordinates contained in the candidate region:

taking the integral abscissa, xf ═ floor (x);

taking an integer ordinate, yf ═ floor (y);

where x is the abscissa of the real coordinate, y is the ordinate of the real coordinate, and floor () represents rounding down.

Enlarging the abscissa: xc ═ xf + 1;

enlarging the ordinate: yc ═ yf + 1;

the coordinates of the keypoints of the candidate region are then:

coordinates of the upper left corner: c _ tl ═ (xf, yf);

lower left corner coordinates: c _ tr ═ (xf, yc);

coordinates of the upper right corner: c _ bl ═ (xc, yf);

lower right corner coordinates: c _ br ═ c, yc.

The diagonal vertex set may be a combination of an upper left corner coordinate and a lower right corner coordinate, or a combination of an upper right corner coordinate and a lower right corner coordinate.

In some embodiments, to effect the weighting of the coordinates of the pixels in the candidate region, the apparatus is further configured to:

Wherein, the weight of the other diagonal vertex in the diagonal vertex set is as follows:

weight of upper left corner coordinate: value (c _ tl) ═ x (xc-xt) × (yc-yt);

weight of lower left corner coordinate: value (c _ tr) ═ xc-xt (yt-yf);

weight of the coordinate at the upper right corner: value (c _ bl) ═ xt-xf (yc-yt);

weight of lower right corner coordinate: value (c _ br) ═ xt-xf (yt-yf);

wherein Value () is a coordinate weight.

As shown in table 1, for example, when the training coordinates are (1.2 ), the number of the horizontal and vertical grids corresponding to the training grid is 5:

TABLE 1

In some embodiments, the coordinates of the pixels in the candidate region are weighted according to the weights, resulting in weighted coordinates.

For example, when the true coordinates are (1.2 ), the pixel coordinates in the candidate region are (1, 1), (1, 2), (2, 1), and (2, 2), the weight corresponding to the pixel coordinate (1, 1) in the candidate region is 0.64, the weight corresponding to the pixel coordinate (1, 2) in the candidate region is 0.16, the weight corresponding to the pixel coordinate (2, 1) in the candidate region is 0.16, the weight corresponding to the pixel coordinate (2, 2) in the candidate region is 0.04, the weights and the coordinates of the pixel in the corresponding candidate region are weighted, (1, 1) corresponding to the enhanced coordinates are (0.64 ), (1, 2) corresponding to the enhanced coordinates are (0.16, 0.32), (2, 1) corresponding to the enhanced coordinates are (0.32, 0.16), and (2, 2) corresponding to the distribution of the weighted coordinates are (0.08 ).

In order to distinguish the difference from the preset training feature map, the coordinates after the pixel weighting are represented by a new training feature map.

In some embodiments, in order to have the effect of presetting the real coordinates of the key points of the training feature map, the apparatus is further configured to:

acquiring coordinates and image resolution of a point to be processed in a training image;

carrying out normalization processing on the coordinates of the points to be processed according to the image resolution of the training images to obtain normalized coordinates of the points to be processed;

acquiring the resolution of a preset training characteristic diagram;

The coordinates of the point to be processed may be coordinates of a key point of the feature to be recognized in the training image. For example, the training image is a figure including hand features, and when the hand features are recognized, the recognition is performed by coordinates of key points of the hand features.

For example, the resolutions of different training images may be different, resulting in coordinates of the to-be-processed points of different training images being in different coordinate systems, and the normalized coordinates may unify the abscissa and the ordinate of the to-be-processed point in the range of 0 to 1.

For example, the graph size of the a training image is 720 × 720, the coordinates of the a to-be-processed point are (252 ), the graph size of the B training image is 540 × 540, and the coordinates of the B to-be-processed point are (189 ), where the a training image and the B training image are images having the same graph content but different resolutions, and two coordinates are used for the representation of the same content, which is not beneficial for training the keypoint detection network. And the coordinates of the points to be processed are normalized through the resolution of the training images, so that the key point detection network can be trained.

The real coordinates corresponding to the preset training feature map can be coordinates of key points, and the coordinates of the points to be processed in the target training image are mapped on the coordinates of the key points in the preset training feature map.

The predicted coordinates may be predicted coordinates of key points in a preset training feature map corresponding to the training image. The predicted coordinates can be coordinates of key points of a preset training characteristic diagram corresponding to the predicted training image by using a key point detection network, and the predicted coordinates can be coincident with the real coordinates, can be close to the real coordinates, and can also be completely deviated from the real coordinates.

For example, the real coordinates are (1.2 ), the weighted coordinates in the new training feature map are (0.64 ), (0.16, 0.32), (0.32, 0.16), (0.08 ), and the coordinates of the pixels in the new training feature map are summed (0.64+0.16+0.32+0.08, 0.64+0.32+0.16+0.08) to obtain the predicted keypoint coordinates (1.2 ), where the predicted coordinates are the same as the real coordinates.

Wherein the loss parameter may be used to indicate whether the predicted coordinates are the same as the real coordinates. For example, when the result to be recognized in the graph is a cat and the predicted result is a dog, the loss parameter is 100%, and when the predicted result is a cat, the loss parameter is 0%.

When the convolutional neural network is trained, the key point detection network needs to be trained according to loss parameters, so that the parameters of the key points used for predicting the image in the key point detection network are more accurate.

120. And performing convolution processing on the target image to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to the characteristic image pixels in the preset characteristic image one to form the output matrix.

In some embodiments, to achieve the effect of obtaining matrix elements, step 120 may include the steps of:

acquiring color parameters corresponding to each image pixel in a target image;

For example, the color parameters may specifically include a color parameter of a red channel, a color parameter of a green channel, and a color parameter of a blue channel.

And performing convolution processing on the color parameters of the red channel, the green channel and the blue channel to obtain matrix elements.

Wherein, the output matrix element can be obtained by the following method:

Z＝ax+by+ck+…；

wherein, Z may be a matrix element, a, b, and c may be parameters learned through training in the keypoint detection network, and x, y, and k may be RGB values corresponding to pixels of the target image.

130. And weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted pixel coordinates of the characteristic image pixels corresponding to the matrix elements.

The matrix elements and the characteristic image pixels are in one-to-one correspondence, so that the coordinates of the characteristic image pixels are weighted according to the matrix elements, and the matrix elements are multiplied by the coordinates of the characteristic image pixels to obtain weighted coordinates of the characteristic image pixels.

140. And summing the weighted pixel coordinates to obtain the coordinates of the key points in the preset characteristic diagram.

The coordinates of the key points of the preset feature map can be used for representing the positions of the key points of the features to be identified mapped in the target image.

150. And determining the coordinates of the key points in the preset feature map mapped in the target image.

In some embodiments, to effect obtaining the coordinates of the keypoints in the target graph, the apparatus is further configured to:

acquiring a resolution ratio between the preset feature map and the target image;

The coordinates of the key points of the target image can be used for representing the positions of the key points of the features to be identified in the target image.

The resolution ratio between the target image and the feature map is 20, the coordinates of the key points of the feature map at this time are (1.2 ), and the coordinates of the key points of the target image are (24, 24).

The target image may be an image for which a hand gesture needs to be predicted, and the type of the hand gesture is determined by coordinates of a key point of the hand gesture.

The scheme provided by the embodiment of the application can be applied to a scene of image key point detection. For example, taking gesture key point detection as an example, a target image related to a gesture is obtained, a preset feature map is obtained, the target image is convolved to obtain an output matrix formed by a plurality of matrix elements, coordinates of feature map pixels in the preset feature map are weighted according to the matrix elements, weighted coordinates of the feature map pixels are summed to obtain coordinates of key points in the preset feature map corresponding to the target image, and thus coordinates of key points mapped in the target image are determined according to the coordinates of the key points in the preset feature map, so that the feature map pixels in the preset feature map provided by the application have spatial information, which is beneficial to improving accuracy of coordinates of the key points corresponding to the output target image, and meanwhile, the application can directly obtain coordinates of the key points of the target image and does not adopt non-end-to-end to obtain coordinates of the key points of the target image, therefore, the information loss amount brought by non-end-to-end calculation of the coordinates of the key points of the target image is reduced. Therefore, the coordinates of the key points of the target image are obtained only through a small amount of calculation, and therefore the obtaining efficiency and the coordinate accuracy of the coordinates of the key points of the target image are improved.

Therefore, the method and the device for obtaining the coordinates of the key points of the target image can reduce the calculation amount for obtaining the coordinates of the key points of the target image, reduce the calculation link which possibly loses coordinate information when the coordinates are obtained, and improve the obtaining efficiency and the coordinate precision of the coordinates of the key points of the target image.

The method described in the above embodiments is further described in detail below.

The output of the last layer of the fully convolutional neural network is a training feature map of W H C, where C is the number of coordinate points to be output. In order to train the convolutional neural network, the truth value of the characteristic diagram needs to be set for network learning. The rules are set as follows: the method comprises the steps of marking four pixels with the nearest real coordinates in a preset training feature icon, setting weights according to the distance, calculating the specific numerical value of the weight according to the distance between the pixel and the real coordinates and a bilinear formula, wherein the sum of the weights of the four pixels is 1, and the weights corresponding to the rest pixel coordinates of the preset training feature icon are 0. And after obtaining a new training characteristic diagram, calculating the cross entropy loss between the output of the new training characteristic diagram and the truth characteristic diagram, and then training the convolutional neural network by using a gradient descent method. When the network is used for coordinate prediction, the weight of each pixel on the new training feature map is multiplied by the coordinate of the pixel to obtain the coordinate to be predicted.

The normalized coordinates are (x, y) and the preset training feature map size is (W, H). The normalized coordinates may be obtained by dividing the horizontal and vertical coordinates of the original coordinates by the resolution of the image, respectively. Firstly, calculating the position of the normalized coordinate on a preset training characteristic diagram:

when W in (W, H) is 0, normalizing the position of the coordinate on the preset training feature map (xt, yt) — (x (W-1), y (H-1));

when W in (W, H) is 1, the position of the normalized coordinate on the preset training feature map is (xt, yt) ((x × W, y × H)).

And calculating 4 nearest pixel positions (xt, yt) on the preset training feature map, wherein the pixel positions are two-dimensional integers of (0,0) to (W-1, H-1).

xf＝floor(xt)；

yf＝floor(yt)；

xc＝xf+1；

yc＝yf+1。

Where floor () represents rounded down, then the positions of the 4 pixels are the upper left coordinates: c _ tl ═ (xf, yf); lower left corner coordinates: c _ tr ═ (xf, yc); coordinates of the upper right corner: c _ bl ═ (xc, yf); lower right corner coordinates: c _ br ═ c, yc.

The value of each pixel is bilinearly distributed according to the distance of the marked coordinates, and the specific calculation formula is as follows:

weight of lower left corner coordinate: value (c _ tr) ═ xc-xt (yt-yf);

weight of lower right corner coordinate: value (c _ br) ═ xt-xf (yt-yf);

the values of the positions other than the 4 pixels on the feature map are set to 0.

The convolutional neural network at position c, i matrix element is denoted as pre (ci), then the cross entropy loss is:

Loss＝-(Value(c_tl)*log(Pre(c_tl))+Value(c_tr)*log(Pre(c_tr))+Value(c_bl)*log(Pre(c_bl))+Value(c_br)*log(Pre(c_br)))。

after the loss parameters are obtained, the key point detection network can be trained by updating the network weight by using a gradient descent method.

After the key point detection network training is finished, multiplying the matrix element of each pixel on the feature map by the coordinate of the pixel to obtain the coordinate of the key point of the feature map:

∑Value(ci)*ci＝(x_pre,y_pre)；

wherein ci represents the position of each pixel on the traversal feature map, x _ pre is used for the abscissa prediction value of the key point, and y _ pre is used for the ordinate prediction value of the key point.

When the output of the key point detection network is close to that of the training feature map, the coordinates of the key points of the feature map are as follows: Σ Value (ci) ═ c _ tl Value (c _ tl) + c _ tr Value (c _ tr) + c _ bl Value (c _ bl) + c _ br Value (c _ br) ═ t (xt, yt).

In conclusion, when the method is applied to a scene of training a key point detection network, the coordinates of the preset feature map are weighted, and the weighted coordinates are summed to obtain the coordinate value of the key point, so that the problem of lower bound of errors caused by the fact that the high-resolution feature map and the non-end-to-end are required to be calculated by a Gaussian heat map method is solved, and the efficiency and the accuracy of a numerical coordinate regression task are improved.

In order to better implement the method, an embodiment of the present application further provides an image keypoint detecting device, where the image keypoint detecting device may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or other devices. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in this embodiment, the method of the embodiment of the present application will be described in detail by taking an example in which the image key point detection device is specifically integrated in a terminal.

For example, as shown in fig. 2, the image keypoint detecting device may include:

an acquisition unit 210;

the obtaining unit 210 is configured to obtain a target image and a preset feature map.

In some embodiments, before acquiring the target image and the preset feature map, the apparatus is further configured to:

acquiring a plurality of training data sets and a key point detection network, wherein the coordinate key point network is used for predicting the coordinates of key points in images, the training data sets are composed of a plurality of training images, and the training images are marked as the coordinates of points to be processed in the training images;

determining a new training feature map corresponding to the pre-training feature map based on the real coordinates of the key points of the pre-training feature map;

In some embodiments, a preset training feature map is obtained, the preset training feature icon is annotated with real coordinates of key points, the real coordinates of the key points are coordinates of a point to be processed of the training image mapped on the preset training feature map, and the apparatus is further configured to:

carrying out normalization processing on the coordinates of the points to be processed according to the image resolution of the target training image to obtain normalized coordinates of the points to be processed;

acquiring the resolution of a preset training characteristic diagram;

In some embodiments, a candidate region corresponding to the keypoint is determined, the candidate region including the keypoint, the apparatus being further configured to:

carrying out downward rounding on the abscissa and the ordinate of the training coordinate to obtain a rounded abscissa and a rounded ordinate;

(II) an output matrix unit 220;

and an output matrix unit 220, configured to perform convolution processing on the target image to obtain an output matrix, where the output matrix is composed of a plurality of matrix elements, and the matrix elements correspond to feature image pixels in a preset feature map one to one.

In some embodiments, convolving the target image to obtain an output matrix, the output matrix comprising a plurality of matrix elements, includes:

acquiring color parameters corresponding to each image pixel in a target image;

A (tri) coordinate weighting unit 230;

and a coordinate weighting unit 230, configured to weight the coordinates of the feature image pixels according to the matrix elements, so as to obtain weighted pixel coordinates of the feature image pixels corresponding to the matrix elements.

(four) a coordinate determination unit 240;

and the coordinate determination unit 240 is configured to sum the weighted pixel coordinates to obtain coordinates of key points in the preset feature map.

(five) a coordinate mapping unit 250;

and a coordinate mapping unit 250, configured to determine coordinates of the key points in the preset feature map mapped on coordinates of the key points in the target image.

In some embodiments, the coordinates of the key points in the preset feature map mapped in the target image are determined, and the apparatus is further configured to:

acquiring a resolution ratio between a preset feature map and a target image;

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in the keypoint detection apparatus of the embodiment, the acquisition unit acquires the target image and the preset feature map; performing convolution processing on the target image by an output matrix unit to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to characteristic image pixels in a preset characteristic image one by one; weighting the coordinates of the characteristic image pixels corresponding to the matrix elements by a coordinate weighting unit to obtain weighted coordinates of the characteristic image pixels corresponding to the matrix elements; the coordinate determination unit sums the weighted pixel coordinates to obtain coordinates of key points in a preset feature map; and determining the coordinates of the key points in the preset feature map, which are mapped in the target image, by a coordinate mapping unit.

Therefore, the coordinates of the key points of the target image are obtained only through a small amount of calculation, and the efficiency of obtaining the coordinates of the key points of the target image and the accuracy of the coordinates are improved.

Therefore, the efficiency of detecting the key points in the target graph is improved. The embodiment of the application also provides the electronic equipment which can be equipment such as a terminal and a server. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, or the like.

In some embodiments, the image key point detection apparatus may also be integrated in a plurality of electronic devices, for example, the image key point detection apparatus may be integrated in a plurality of servers, and the key point detection method of the present application is implemented by the plurality of servers.

In this embodiment, a detailed description will be given by taking the electronic device of this embodiment as an example of a mobile terminal, for example, as shown in fig. 3, which shows a schematic structural diagram of the mobile terminal according to the embodiment of the present application, specifically:

the mobile terminal may include components such as a processor 310 of one or more processing cores, memory 320 of one or more computer-readable storage media, a power supply 330, an input module 340, and a communication module 350. Those skilled in the art will appreciate that the configuration shown in fig. 3 does not constitute a limitation of the mobile terminal and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 310 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory 320, thereby performing overall monitoring of the mobile terminal. In some embodiments, processor 310 may include one or more processing cores; in some embodiments, the processor 310 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 310.

The memory 320 may be used to store software programs and modules, and the processor 310 executes various functional applications and data processing by operating the software programs and modules stored in the memory 320. The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the mobile terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 320 may also include a memory controller to provide the processor 310 with access to the memory 320.

The mobile terminal further includes a power supply 330 for supplying power to the various components, and in some embodiments, the power supply 330 may be logically connected to the processor 310 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system. The power supply 330 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The mobile terminal may further include an input module 340, and the input module 340 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, microphone, optical or trackball signal inputs related to user settings and function control.

The mobile terminal may further include a communication module 350, and in some embodiments the communication module 350 may include a wireless module, through which the mobile terminal may perform short-range wireless transmission, thereby providing wireless broadband internet access to the user. For example, the communication module 350 may be used to assist a user in emailing, browsing web pages, accessing streaming media, and the like.

Although not shown, the mobile terminal may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 310 in the mobile terminal loads the executable file corresponding to the process of one or more application programs into the memory 320 according to the following instructions, and the processor 310 runs the application programs stored in the memory 320, thereby implementing various functions as follows:

acquiring a target image and a preset feature map;

and determining the coordinates of the key points in the preset feature map mapped in the target image.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, in this embodiment, the coordinates of the key points of the feature map are not obtained in a non-end-to-end manner like the gaussian thermal mapping method, which reduces errors that may occur when the coordinates of the key points of the feature map are calculated in a non-end-to-end manner, and meanwhile, the spatial information of the feature map is not lost when the coordinates of the key points of the feature map are calculated like a direct regression method, which results in poor spatial generalization, so that the accuracy of the coordinates of the key points of the feature map is affected. According to the method and the device, the coordinates of the key points of the target image are obtained only through a small amount of calculation, and therefore the efficiency of obtaining the coordinates of the key points of the target image and the accuracy of the coordinates are improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a computer-readable storage medium, in which a plurality of computer programs are stored, where the computer programs can be loaded by a processor to execute the steps in any one of the image keypoint detection methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

acquiring a target image and a preset feature map;

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any image key point detection method provided in the embodiments of the present application, beneficial effects that can be achieved by any image key point detection method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The image key point detection method, the image key point detection device, the storage medium and the computer device provided by the embodiments of the present application are introduced in detail, and a specific example is applied to illustrate the principle and the implementation manner of the present application, and the description of the embodiments is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An image key point detection method is characterized by comprising the following steps:

acquiring a target image and a preset feature map;

performing convolution processing on the target image to obtain an output matrix, wherein the output matrix consists of a plurality of matrix elements, and the matrix elements correspond to characteristic image pixels in the preset characteristic image one to one;

summing the weighted pixel coordinates to obtain coordinates of key points in the preset feature map;

2. The image keypoint detection method of claim 1, wherein said determining the coordinates of keypoints in the preset feature map mapped to the target image comprises:

3. The image keypoint detection method of claim 1, wherein said convolving said target image to obtain an output matrix, said output matrix being composed of a plurality of matrix elements, comprises:

acquiring color parameters corresponding to each image pixel in the target image;

4. The image key point detection method according to claim 1, before the acquiring the target image and the preset feature map, comprising:

training the key point detection network by using the plurality of training data sets until the key point detection network is converged to obtain the trained key point detection network;

the convolution processing of the target image to obtain an output matrix includes:

5. The image keypoint detection method of claim 4, wherein said training the keypoint detection network using the plurality of training data sets comprises:

acquiring a preset training feature map, wherein the preset training feature icon is annotated with real coordinates of key points, and the real coordinates of the key points are coordinates of points to be processed of the training image, which are mapped on the preset training feature map;

6. The image keypoint detection method of claim 5, wherein said determining a new training feature map corresponding to the preset training feature map based on the real coordinates of the keypoints of the preset training feature map comprises:

determining a candidate region corresponding to the key point, wherein the candidate region comprises the key point;

determining the weight of another diagonal vertex in a diagonal vertex set according to the coordinates of the diagonal vertex in the diagonal vertex set and the real coordinates of the key points, wherein the diagonal vertex set comprises the diagonal vertex, the diagonal vertex is two vertices located on the same diagonal of the candidate region, and the vertex is a pixel;

7. The image key point detection method of claim 5, wherein the obtaining a preset training feature map, the preset training feature icon being annotated with real coordinates of key points, the real coordinates of key points being coordinates of points to be processed of the training image mapped on the preset training feature map, comprises:

acquiring the resolution of a preset training characteristic diagram;

and calculating the corresponding coordinates of the point to be processed mapped on the preset training feature map according to the resolution of the preset training feature map and the normalized coordinates, wherein the real coordinates corresponding to the preset training feature map comprise the corresponding coordinates of the point to be processed mapped on the preset training feature map.

8. The image keypoint detection method of claim 6, wherein said determining a candidate region corresponding to said keypoint, said candidate region comprising said keypoint, comprises:

and combining every two of the abscissa and the ordinate according to the rounding abscissa, the rounding ordinate, the enlarged abscissa and the enlarged ordinate to obtain a learning training pixel coordinate, wherein the abscissa comprises the rounding abscissa or the enlarged abscissa, and the ordinate comprises the rounding ordinate or the enlarged ordinate.

9. The image keypoint detection method of claim 6, wherein said determining a weight of a diagonal vertex in the set of diagonal vertices from its coordinates and the coordinates of said keypoints, comprises:

calculating a diagonal vertex abscissa difference value, wherein the abscissa difference value is a difference value between the abscissa of the diagonal vertex and the abscissa of the key point;

calculating a diagonal vertex vertical coordinate difference value, wherein the vertical coordinate difference value is a difference value between the vertical coordinate of the diagonal vertex and the vertical coordinate of the key point;

10. An image key point detection device characterized by comprising:

the coordinate weighting unit is used for weighting the coordinates of the characteristic image pixels according to the matrix elements to obtain weighted pixel coordinates of the characteristic image pixels corresponding to the matrix elements;

the coordinate determination unit is used for summing the weighted pixel coordinates to obtain coordinates of key points in the preset feature map;

and the coordinate mapping unit is used for determining the coordinates of the key points in the preset feature map mapped in the target image.

11. A terminal comprising a processor and a memory, said memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the image keypoint detection method according to any of claims 1 to 9.

12. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the image keypoint detection method according to any one of claims 1 to 9.