CN114663671A

CN114663671A - Target detection method, device, equipment and storage medium

Info

Publication number: CN114663671A
Application number: CN202210158387.8A
Authority: CN
Inventors: 埃德温·威廉·特雷霍·庞特
Original assignee: PCI Technology Group Co Ltd; PCI Technology and Service Co Ltd
Current assignee: PCI Technology Group Co Ltd; PCI Technology and Service Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-06-24
Anticipated expiration: 2042-02-21
Also published as: CN114663671B

Abstract

The application discloses a target detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: expanding an image to be detected into a square image by performing zero-pixel expansion on the image to be detected; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; the convolution network in the size adjustment model is configured with a deformable convolution layer and a down-sampling layer; and inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result. By the technical means, the problem that the target detection accuracy is low due to the loss and deformation of the characteristics of the image to be detected in the prior art is solved, the integrity and the accuracy of the image characteristic data extracted by the deep learning model are guaranteed, and the target detection accuracy is improved.

Description

Target detection method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a target detection method, a target detection device, target detection equipment and a storage medium.

Background

Object detection techniques are techniques used in computer vision to locate target objects in images, and generally detect target objects in images through a deep learning model. However, the input image of the deep learning model is of a fixed size, and the size of the image to be detected is different, so that the image to be detected needs to be preprocessed, and the image to be detected is converted into an image meeting the size requirement of the deep learning model.

The preprocessing method in the current target detection is to scale the image to be detected by an interpolation method according to the ratio of the size of the image to be detected and the size of an input image of a deep learning model. However, when the image is zoomed, the preprocessing method often loses partial image features and causes image feature deformation, which affects the integrity and accuracy of the image feature data extracted by the depth learning model and reduces the accuracy of target detection.

Disclosure of Invention

The embodiment of the application provides a target detection method, a target detection device and a storage medium, solves the problem that the accuracy of target detection is low due to loss and deformation of the characteristics of an image to be detected in the prior art, ensures the integrity and accuracy of image characteristic data extracted by a deep learning model, and improves the accuracy of target detection.

In a first aspect, an embodiment of the present application provides a target detection method, including:

performing zero-pixel expansion on an image to be detected to expand the image to be detected into a square image;

inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein a convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer;

and inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result.

Further, the inputting the square image into a pre-trained resizing model to obtain a target image output by the resizing model includes:

inputting the square image into a first convolution network to obtain a first characteristic image output by the first convolution network, wherein the first convolution network comprises two convolution layers;

inputting the first characteristic image into a second convolution network to obtain a second characteristic image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a down-sampling layer;

and inputting the square image into a first downsampling layer, and adding the output of the first downsampling layer and the second characteristic image to obtain the target image.

Further, the inputting the first feature image into a second convolution network to obtain a second feature image output by the second convolution network includes:

inputting the first characteristic image into a second down-sampling layer to obtain a third characteristic image output by the second down-sampling layer;

inputting the third characteristic image into a first deformable convolution layer, and adding the input and the output of the first deformable convolution layer to obtain a fourth characteristic image;

and inputting the fourth characteristic image into a second deformable convolution layer to obtain a second characteristic image output by the second deformable convolution layer.

Further, the inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result includes:

inputting the target image into the target detection model to obtain a heat map output by the target detection model, and determining key points in the heat map;

fitting a Gaussian function according to the key points and the corresponding adjacent regions of the key points in the heat map, and determining the positions of the key points in the heat map according to the Gaussian function;

and mapping the key point positions to the image to be detected, determining target key points corresponding to the key point positions in the image to be detected, and determining a target object in the image to be detected according to the target key points.

Further, the target detection model comprises a backbone network and a deconvolution network;

correspondingly, the inputting the target image into the target detection model to obtain the heat map output by the target detection model includes:

inputting the target image into the backbone network to obtain characteristic data output by the backbone network, wherein the backbone network comprises a convolution layer and three bottleneck layers;

and inputting the characteristic data into the deconvolution network to obtain a heat map output by the deconvolution network, wherein the deconvolution network comprises an up-sampling layer and three convolution layers.

Further, the fitting a gaussian function according to the keypoints and the corresponding neighboring regions of the keypoints in the heatmap, and determining the positions of the keypoints in the heatmap according to the gaussian function includes:

determining the pixel coordinates and the fraction values of the adjacent regions of the key points according to the pixel coordinates of the key points;

performing Gaussian surface fitting according to the fraction value and the pixel coordinate of the key point and the pixel coordinate and the fraction value of the adjacent area to obtain the Gaussian function;

and determining the coordinates of the central point of the Gaussian function as the positions of the key points.

Further, the target detection method further includes:

acquiring a plurality of training sample images and coordinates of key points correspondingly marked, and expanding the training sample images into square sample images;

inputting the square sample image into an initial size adjustment model to obtain a target sample image output by the initial size adjustment model, and determining the mapping coordinates of the key point coordinates in the target sample image;

converting the target sample image into a corresponding Gaussian heatmap label according to a preset Gaussian function and the mapping coordinate;

and inputting the target sample image into an initial target detection model, and adjusting parameters of the size adjustment model and the target detection model according to a Gaussian heat map and Gaussian heat map labels output by the initial target detection model.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the image processing device comprises an expansion module, a detection module and a processing module, wherein the expansion module is configured to perform zero-pixel expansion on an image to be detected so as to expand the image to be detected into a square image;

the preprocessing module is configured to input the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein a convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer;

and the target detection module is configured to input the target image into a pre-trained target detection model, obtain a detection result output by the target detection model, and determine a target object in the image to be detected according to the detection result.

Further, the preprocessing module comprises: a first feature extraction module configured to input the square image into a first convolution network to obtain a first feature image output by the first convolution network, where the first convolution network includes two convolution layers; a second feature extraction module configured to input the first feature image into a second convolution network, resulting in a second feature image output by the second convolution network, where the second convolution network includes two deformable convolution layers and one down-sampling layer; and the first residual module is configured to input the square image into a first downsampling layer and add the output of the first downsampling layer and the second characteristic image to obtain the target image.

Further, the second feature extraction module includes: a down-sampling unit configured to input the first feature image into a second down-sampling layer, resulting in a third feature image output by the second down-sampling layer; a second residual unit configured to input the third feature image into a first deformable convolution layer and add an input and an output of the first deformable convolution layer to obtain a fourth feature image; and the feature extraction unit is configured to input the fourth feature image into a second deformable convolution layer to obtain a second feature image output by the second deformable convolution layer.

Further, the object detection module comprises: the heat map acquisition unit is configured to input the target image into the target detection model, obtain a heat map output by the target detection model, and determine key points in the heat map; a key point position determining unit configured to fit a gaussian function according to the key points and corresponding adjacent regions of the key points in the heat map, and determine key point positions in the heat map according to the gaussian function; the mapping unit is configured to map the key point positions to the image to be detected, determine target key points corresponding to the key point positions in the image to be detected, and determine a target object in the image to be detected according to the target key points.

Further, the target detection model comprises a backbone network and a deconvolution network; correspondingly, the heat map acquisition unit comprises: a backbone network subunit, configured to input the target image into the backbone network to obtain feature data output by the backbone network, where the backbone network includes a convolution layer and three bottleneck layers; and the deconvolution subunit is configured to input the feature data into the deconvolution network to obtain a heatmap output by the deconvolution network, and the deconvolution network comprises an upsampling layer and three convolution layers.

Further, the key point position determination unit includes: an adjacent region subunit configured to determine, according to the pixel coordinates of the keypoint, pixel coordinates and a score value of an adjacent region of the keypoint; the Gaussian function subunit is configured to perform Gaussian surface fitting according to the fraction value and the pixel coordinate of the key point and the pixel coordinate and the fraction value of the adjacent area to obtain a Gaussian function; a position determination subunit configured to determine center point coordinates of the Gaussian function as the key point position.

Further, the object detection device includes: the system comprises a sample acquisition module, a comparison module and a display module, wherein the sample acquisition module is configured to acquire a plurality of training sample images and key point coordinates of corresponding marks and expand the training sample images into square sample images; the sample preprocessing module is configured to input the square sample image into an initial size adjustment model, obtain a target sample image output by the initial size adjustment model, and determine mapping coordinates of the key point coordinates in the target sample image; the sample heat map acquisition module is configured to convert the target sample image into a corresponding Gaussian heat map label according to a preset Gaussian function and the mapping coordinate; and the model training module is configured to input the target sample image into an initial target detection model, and adjust the parameters of the size adjustment model and the target detection model according to the Gaussian heatmap and the Gaussian heatmap label output by the initial target detection model.

In a third aspect, an embodiment of the present application provides an object detection apparatus, including:

a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the object detection method of the first aspect.

In a fourth aspect, embodiments of the present application provide a storage medium containing computer-executable instructions for performing the object detection method according to the first aspect when executed by a computer processor.

The method comprises the steps of performing zero pixel expansion on an image to be detected to expand the image to be detected into a square image; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; the convolution network in the size adjustment model is configured with a deformable convolution layer and a down-sampling layer; and inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result. Through the technical means, the size of the image to be detected is compressed to the target size through the down sampling layer by the size adjustment model, the convolution position is adjusted through the offset in the deformable convolution layer so as to fit the shape of the target object in the image to be detected, more comprehensive characteristics are extracted, the image characteristic distortion and deformation caused by size compression of the image to be detected are avoided, the integrity and the accuracy of the image characteristics in the input image of the target detection model are improved, and the accuracy of target detection is improved.

Drawings

FIG. 1 is a flow chart of a method for target detection provided by an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an image to be detected expanded into a square image according to an embodiment of the present application;

fig. 3 is a schematic network structure diagram of a resizing model provided in an embodiment of the present application;

FIG. 4 is a flowchart of resizing a square image through a resizing model provided by an embodiment of the present application;

FIG. 5 is a flowchart of extracting a second feature image through a second convolution network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a square image provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a conventional target image provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a target image provided by an embodiment of the present application;

FIG. 9 is a flowchart for determining a target object in an image to be detected according to an embodiment of the present application;

FIG. 10 is a flow diagram of an object detection model extraction heat map provided by an embodiment of the present application;

FIG. 11 is a flow chart for determining the location of a keypoint based on Gaussian fitting as provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of an output heat map of an object detection model provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of a neighboring area of a keypoint provided by an embodiment of the present application;

FIG. 14 is a schematic diagram of an image including a palm provided by an embodiment of the present application;

FIG. 15 is a flow chart of model training provided by embodiments of the present application;

FIG. 16 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, specific embodiments of the present application will be described in detail with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some but not all of the matters relating to the present application are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The application provides a target detection method, a target detection device, a target detection equipment and a storage medium, which aim to expand an image to be detected into a square image by performing zero-pixel expansion on the image to be detected; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; the convolution network in the size adjustment model is configured with a deformable convolution layer and a down-sampling layer; and inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result. Compared with the traditional preprocessing method in target detection, the method is characterized in that the image to be detected is zoomed through an interpolation method according to the proportion of the size of the image to be detected and the size of an input image of a depth learning model. However, when the image is zoomed, the preprocessing method often loses part of the image features and causes image feature deformation, which affects the integrity and accuracy of the image feature data and reduces the accuracy of target detection. Based on this, the object detection method, device, equipment and storage medium provided by the embodiment of the application are provided to solve the problem of low accuracy of the existing object detection.

Fig. 1 is a flowchart of a target detection method according to an embodiment of the present application. The target detection method provided in this embodiment may be executed by a target detection device, where the target detection device may be implemented in a software and/or hardware manner, and the target detection device may be formed by two or more physical entities or may be formed by one physical entity.

The following description will be given taking an object detection apparatus as an example of a subject that performs the object detection method. Referring to fig. 1, the object detection method includes:

s110, performing zero-pixel expansion on the image to be detected so as to expand the image to be detected into a square image.

The image to be detected is an original image that needs to be subjected to target detection, the original image includes a target object to be detected, and this embodiment describes the target object as a palm. The image to be detected can be a gray image directly acquired by an infrared camera or a gray image obtained by performing gray processing on a color image acquired by an RGB camera. Further, since the sizes of the original images collected by different cameras or different camera modes are different, and the target size of the input image of the target detection model is a fixed size, before the target object in the original image is detected by the target detection model, the size of the original image may be scaled to the target size.

In one embodiment, the target size of the input image of the target detection model of the present embodiment may be set to 128 × 128, that is, the input image is a square image. If the image to be detected is a non-square image, the image to be detected is directly zoomed into a square image, which causes characteristic deformation in the image to be detected and affects the accuracy of target detection. Therefore, before the size of the image to be detected is scaled to the target size, the image to be detected is placed in the corresponding square image according to the maximum size of the width size and the height size of the image to be detected as the side length size of the square image, and zero pixels are filled in the remaining area of the square image except the image to be detected, so that the square image corresponding to the image to be detected is obtained. In this embodiment, fig. 2 is a schematic diagram of the image to be detected expanded into a square image according to the embodiment of the present application. As shown in fig. 2, if the height H of the image to be detected is greater than the width W, a zero-pixel region with the size of [ H, H-W ] is expanded on the right side of the image to be detected, so as to obtain a square image corresponding to the image to be detected. And if the height H of the image to be detected is smaller than the width W, expanding a zero-pixel area with the size of [ W-H, W ] at the lower end of the image to be detected to obtain a square image corresponding to the image to be detected. Wherein X is a dimensional value. It should be noted that the image to be detected can be disposed at any position in the square image, and is not limited to be disposed at the left side or the upper end.

S120, inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer.

The size adjustment model is a convolution network used for adjusting the size of a square image of the image to be detected to a target size. The target image is an input image of the target detection model, and the target size is the size of the input image. The convolutional network is configured with a deformable convolutional layer and a downsampling layer for down-adjusting the image resolution to reduce the size of the image to a target size. The deformable convolution layer is used to preserve shape features in the image when the image is compressed. Illustratively, a deformable convolution layer refers to a convolution layer whose convolution position can be deformed, and the deformable convolution can learn the shape in the image and adjust the convolution position by the offset to extract the shape feature in the image. The shape features in the image are important features when the target detection model detects the target object, and the shape features in the image are extracted through deformable convolution, so that the problem that the accuracy of target detection is influenced due to deformation distortion generated by image compression of the shape features is avoided.

In an embodiment, fig. 3 is a schematic network structure diagram of a resizing model provided in the embodiment of the present application. As shown in fig. 3, the resizing model includes a first convolution network, a second convolution network and a first downsampling layer, the input of the resizing model is the input of the first convolution network and the first downsampling layer, the output of the first convolution network is the input of the second convolution network, and the output of the second convolution network and the output of the first downsampling layer are added to be the output of the resizing model. In this embodiment, fig. 4 is a flowchart for resizing a square image through a resizing model according to an embodiment of the present application. As shown in fig. 4, the step of adjusting the size of the square image by the resizing model specifically includes S1201 to S1203:

s1201, inputting the square image into a first convolution network to obtain a first characteristic image output by the first convolution network, wherein the first convolution network comprises two convolution layers.

Referring to fig. 3, the first convolutional network includes a first two-dimensional convolutional layer and a second two-dimensional convolutional layer, which are connected in series to form the first convolutional network. After the square image is input into the first convolution network, depth features in the square image are sequentially extracted through the first two-dimensional convolution layer and the second two-dimensional convolution layer, and a first feature image is obtained. In one embodiment, the number of channels in the first two-dimensional convolutional layer is 16, the convolutional kernel size is 7, the step size is 1, and the number of filled pixels is 3. The number of channels of the second two-dimensional convolutional layer is 16, the size of the convolutional kernel is 3, the step length is 1, and the number of the filled pixels is 1. The first convolution network outputs a first feature image having the same size as the square image.

S1202, inputting the first characteristic image into a second convolution network to obtain a second characteristic image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a down-sampling layer.

Referring to fig. 3, the second convolutional network includes a first deformable convolutional layer, a second deformable convolutional layer, and a second downsampling layer, an input of the second convolutional network is an input of the second downsampling layer, an output of the second downsampling layer is an input of the first deformable convolutional layer, an output and an input of the first deformable convolutional layer are added as an input of the second deformable convolutional layer, and an output of the second deformable convolutional layer is an output of the second convolutional network. In one embodiment, the second downsampling layer resizes the input image of the second convolutional network to a target size using bilinear interpolation. The number of channels of the first deformable convolution layer is 16, the convolution kernel size is 3, the step size is 1, and the number of filled pixels is 1. The number of channels of the second deformable convolution layer is 1, the convolution kernel size is 3, the step size is 1, and the number of filled pixels is 1. It should be noted that, in the second convolutional network, only the second downsampling layer adjusts the size of the corresponding input image, but the deformable convolutional layer does not, i.e., the input-output size of the deformable convolutional layer remains unchanged.

In this embodiment, fig. 5 is a flowchart for extracting a second feature image through a second convolution network according to an embodiment of the present application. As shown in fig. 5, the step of extracting the second feature image through the second convolutional network specifically includes S12021-S12023:

s12021, the first feature image is input to the second downsampling layer, and a third feature image output by the second downsampling layer is obtained.

Illustratively, the second downsampling layer compresses the size of the first feature image extracted by the first convolution network to a target size, resulting in a third feature image.

S12022, inputting the third characteristic image into the first deformable convolution layer, and adding the input and the output of the first deformable convolution layer to obtain a fourth characteristic image.

Illustratively, the first deformable convolution layer is composed of a conventional two-dimensional convolution layer and an offset convolution layer, the convolution kernel in the conventional two-dimensional convolution layer is used for performing convolution with the pixel value at the position corresponding to the input image, the convolution kernel in the offset convolution layer is used for learning the characteristic shape in the input image, and the convolution position in the conventional two-dimensional convolution layer is adjusted through the offset, so that the convolution kernel of the conventional two-dimensional convolution layer extracts the shape characteristic in the input image, and deformation distortion of the image caused by size compression is avoided.

In this embodiment, the third feature image that has been resized is input to the first deformable convolution layer, which extracts the shape feature in the third feature image. In order to avoid gradient diffusion or gradient explosion of the image due to excessive depth increase, a residual error network structure is introduced, namely the input features and the output features of the first deformable convolution layer are added, and the input features are reserved.

S12023, inputting the fourth characteristic image into the second deformable convolution layer to obtain a second characteristic image output by the second deformable convolution layer.

Similarly, the second deformable convolution layer is also comprised of a conventional two-dimensional convolution layer and an offset convolution layer. And extracting the shape features in the fourth feature image through the second deformable convolution layer to obtain a second feature image output by the second deformable convolution layer.

S1203, inputting the square image into a first downsampling layer, and adding the output of the first downsampling layer and the second characteristic image to obtain a target image.

And the first lower sampling layer adjusts the size of the square image to be the target size by adopting bilinear interpolation. The second characteristic image is a depth characteristic image obtained after depth characteristics are extracted from the square image through the first convolution network and the second convolution network, and a residual error network structure is introduced to avoid gradient dispersion or gradient explosion of the second characteristic image due to excessive depth increase. And the output image of the first down-sampling layer reserves the characteristics of the image to be detected, and the output of the first down-sampling layer is added with the output of the second deformable convolution layer to obtain a target image reserving the characteristics of the image to be detected.

In one embodiment, the convolutional layers and the deformable convolutional layers in the first convolutional network and the second convolutional network are configured with activation functions, and nonlinear factors in the convolutional networks are increased through the activation functions to extract effective depth features in the images. Illustratively, the activation function may be a Relu function.

In an embodiment, fig. 6 is a schematic diagram of a square image provided in the embodiment of the present application. Fig. 7 is a schematic diagram of a conventional target image provided in an embodiment of the present application. Fig. 8 is a schematic diagram of a target image provided in an embodiment of the present application. As shown in fig. 6 to 8, the size of the square image is 640 × 640, the size of the conventional target image is 128 × 128, and the size of the target image is 128 × 128, where the conventional target image is an input image of the target detection model obtained by compressing the square image by bilinear interpolation in the conventional preprocessing method. Referring to fig. 7, the shape of the target object in the conventional target image obtained by bilinear interpolation may be deformed and distorted, which affects the accuracy of target detection. Referring to fig. 8, the target image in fig. 8 is an input image of the target detection model obtained by compressing a square image by the resizing model. The size adjustment model effectively keeps the shape characteristics of the target object while compressing the image size, and the accuracy of target detection is ensured.

S130, inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result.

In an embodiment, the target detection model is a network model for detecting key points of the target object, the target image is input into the target detection model to obtain a heat map output by the target detection model, the position of the key point is determined according to the heat map, and the target object in the original image is determined based on the position of the key point. In this embodiment, fig. 9 is a flowchart for determining a target object in an original image according to an embodiment of the present application. As shown in fig. 9, the step of determining the target object in the original image specifically includes S1301-S1303:

s1301, inputting the target image into the target detection model to obtain a heat map output by the target detection model, and determining key points in the heat map.

The target detection model is a deep learning model used for extracting a heat map of a target image, the target image is input into the target detection model, and the heat map of the target image is output. And calculating the scores of all pixel points in the output heat map, and determining the pixel point with the highest score as a key point in the output heat map.

In one embodiment, the target detection model is composed of a backbone network and a deconvolution network, the backbone network is used for extracting depth features of a target image, the depth features are used as input of the deconvolution network, and the deconvolution network outputs a heat map according to the depth features. In this embodiment, fig. 10 is a flowchart of extracting a heat map by the target detection model provided in this embodiment of the present application. As shown in fig. 10, the step of extracting the heat map by the target detection model specifically includes steps S13011-13012:

s13011, inputting the target image into a backbone network to obtain the feature data output by the backbone network, wherein the backbone network comprises a convolution layer and three bottleneck layers.

The backbone network in this embodiment adopts a model structure of MobileNet v2, which includes one convolution layer and three bottleneck layers, where the convolution layer is a two-dimensional convolution layer. And the main network extracts the depth characteristic data from the target image through the convolution layer and the bottleneck layer. In addition, compared with the traditional structure model of MobileNet v2, the backbone network in the embodiment eliminates a repeated upsampling layer and a repeated cutting layer, and greatly reduces inference time without sacrificing model precision.

S13012, inputting the characteristic data into a deconvolution network to obtain a heatmap output by the deconvolution network, wherein the deconvolution network comprises an up-sampling layer and three convolution layers.

Illustratively, the number of heatmaps output by the deconvolution network is the same as the number of target key points. For example, the key points are the connection point of the index finger and the middle finger, the connection point of the middle finger and the ring finger, and the connection point of the ring finger and the little finger, and correspondingly, three heat maps output by the deconvolution network are provided, wherein one heat map correspondingly comprises one connection point.

S1302, fitting a Gaussian function according to the key points and the adjacent areas of the key points corresponding to the heat maps, and determining the positions of the key points in the heat maps according to the Gaussian function.

Where keypoint locations refer to the location coordinates of the floating point values of the keypoints in the heatmap. Exemplarily, since a key point with the highest score in the heat map belongs to a pixel point in the heat map, and a pixel coordinate of the pixel point is an integer value, a pixel area is determined instead of a pixel point after the pixel coordinate of the integer value is mapped into an image to be detected, so that an offset error occurs in key point detection, and a detection result is not accurate enough. Therefore, the present embodiment proposes to determine the position coordinates of the floating point values of the key points by gaussian fitting the distribution of the score values of the key points and the neighboring areas in the perceptual heat map. In this embodiment, fig. 11 is a flowchart for determining the positions of the keypoints according to gaussian fitting, provided in an embodiment of the present application. As shown in fig. 11, the step of determining the positions of the keypoints according to gaussian fitting specifically includes steps S13021-S13023:

s13021, determining the pixel coordinates and the score values of the adjacent areas of the key points according to the pixel coordinates of the key points.

Fig. 12 is a schematic diagram of an output heat map of an object detection model provided by an embodiment of the present application. As shown in fig. 12, fig. 12 is one of the heat maps output by the object detection model, where the pixel point with the highest brightness in the heat map is the key point in the heat map, and the pixel coordinate of the key point in the heat map is (14, 10). The adjacent area of the key point is an area formed by pixel points with the distance from the key point smaller than a certain threshold value. Fig. 13 is a schematic diagram of a neighboring area of a keypoint provided in an embodiment of the present application. As shown in fig. 13, the neighboring area of the keypoint may be a rectangular area centered on the keypoint and surrounding the keypoint, and the neighboring area and the keypoint form a 3 × 3 rectangular area. In this embodiment, when the pixel coordinate of the key point is (14, 10), the pixel coordinates of the corresponding, adjacent regions are (13, 9), (13, 10), (13, 11), (14, 9), (14, 11), (15, 9), (15, 10), and (15, 11). And acquiring the score value of the pixel point corresponding to each pixel coordinate from the corresponding heat map according to the pixel coordinate of the adjacent area.

S13022, performing Gaussian surface fitting according to the fractional values and the pixel coordinates of the key points and the pixel coordinates and the fractional values of the adjacent areas to obtain a Gaussian function.

This embodiment is described by taking an example in which the neighboring area and the keypoint form a 3 × 3 rectangular area. As can be seen from the above, the expression of the Gaussian function is

The standard deviation σ of the gaussian distribution adopted in this embodiment is the same as the standard deviation σ adopted in the training of the model, f (x, y) is a fractional value where the pixel coordinate is (x, y), and (x0, y0) is the coordinate of the center point of the gaussian function and is also an unknown number in the gaussian function. Substituting the fractional values and pixel coordinates of the keypoints into the gaussian function yields equations for x0 and y0, and substituting the pixel coordinates and fractional values of adjacent regions into the gaussian function yields equations for x0 and y 0. The equations and the system of equations are combined to solve for x0 and y0, resulting in a gaussian function.

And S13023, determining the coordinates of the central point of the Gaussian function as the position of the key point.

For example, since the function value at the center point of the gaussian function is the maximum value of the gaussian function, the score value at the center point of the gaussian function fitted based on the key point and the neighboring area is the maximum value in the entire rectangular area, and the coordinates (x0, y0) of the center point of the gaussian function are used as the coordinates of the floating point value of the key point.

And S1303, mapping the key point positions to an image to be detected, determining target key points corresponding to the key point positions in the image to be detected, and determining a target object in the image to be detected according to the target key points.

Illustratively, after determining the key point position in the heat map according to gaussian fitting, determining a corresponding pixel point in the original image according to the key point position, and taking the pixel point as a target key point. In one embodiment, the pixel coordinates of the keypoint locations mapped in the square image are determined as the pixel coordinates of the target keypoint, according to the size ratio of the heat map to the square image.

Further, assuming that the palm center is a target object in the original image including the palm, there is a certain positional relationship between the connecting points of the palm center and the palm. After three heat maps output by the target detection model are obtained, the floating point value coordinate position of the connecting point in each heat map is determined, the floating point value coordinate position is mapped into the original image, and three connecting points in the palm image are determined. Fig. 14 is a schematic diagram of an image including a palm according to an embodiment of the present application. As shown in fig. 14, points a, B and C are three target key points in the palm image, respectively, where point a is a connection point of the little finger and the ring finger, point B is a connection point of the middle finger and the ring finger, and point C is a connection point of the middle finger and the index finger. p1, p2, p3 and p4 are four vertexes of the region of interest respectively, a certain position relation exists between p1, p2, p3 and p4 and the points A, B and C, the pixel coordinates of p1, p2, p3 and p4 can be uniquely determined according to the pixel coordinates of the points A, B and C, and then the target object in the original image is intercepted according to p1, p2, p3 and p 4.

In one embodiment, fig. 15 is a flowchart of model training provided by the embodiments of the present application. As shown in fig. 15, the step of model training specifically includes S210-S240:

s210, obtaining a plurality of training sample images and the coordinates of the corresponding marked key points, and expanding the training sample images into square sample images.

Illustratively, the training sample image is an image obtained by performing gray processing on an image acquired by a camera. And expanding the training sample image into a square sample image according to the length size and the width size of the training sample image, and determining the pixel coordinates of the marked key points in the training sample image, which correspond to the square sample image.

S220, inputting the square sample image into the initial size adjustment model to obtain a target sample image output by the initial size adjustment model, and determining the mapping coordinates of the key point coordinates in the target sample image.

Wherein the initial resizing model is an untrained resizing model. Illustratively, a square sample image is input into the initial resizing model, resulting in a target sample image output by the initial resizing model. And scaling the coordinate of the key point pixel in the square sample image in an equal proportion according to the size proportion of the square sample image and the target sample image to obtain the coordinate of the key point pixel in the target sample image.

And S230, converting the target sample image into a corresponding Gaussian heatmap label according to a preset Gaussian function and the mapping coordinate.

Illustratively, the expression of the gaussian function is:

wherein (mu)_x,μ_y) Is the keypoint pixel coordinates in the target sample image. Sigma is the standard deviation of Gaussian distribution, and can be set according to actual requirements. f (x, y) is the fractional value of the gaussian heat map at the pixel point with pixel coordinates (x, y). And substituting the pixel coordinates and the key point coordinates of each pixel point of the target sample image into the Gaussian function to obtain the score value of each pixel point in the Gaussian heat map, and finally obtaining the Gaussian heat map label.

S240, inputting the target sample image into the initial target detection model, and adjusting parameters of the size adjustment model and the target detection model according to the Gaussian heat map and the Gaussian heat map label output by the initial target detection model.

Wherein the initial target detection model is an untrained target detection model. In the training stage, the heat map output by the initial target detection model and a Gaussian heat map label corresponding to the target sample image are substituted into a loss function, and model parameters of the size adjustment model and the target detection model are adjusted according to a loss result output by the loss function until the model converges or the iteration number reaches the upper limit.

In conclusion, the size of the image to be detected is compressed to the target size through the down-sampling layer in the size adjustment model, the convolution position is adjusted through the offset in the deformable convolution layer so as to fit the shape of the target object in the image to be detected, more comprehensive characteristics are extracted, the image characteristic distortion and deformation caused by size compression of the image to be detected are avoided, the integrity and the accuracy of the image characteristics in the input image of the target detection model are improved, and the accuracy of target detection is improved. In addition, a heat map of the target image is obtained through the target detection model, the key points are determined, the corresponding Gaussian function is obtained through Gaussian fitting of the key points and Gaussian curved surfaces of adjacent regions, the coordinates of the center point of the Gaussian function are used as the coordinates of the floating point value of the key points, the phenomenon that the key points are mapped to the image to be detected to generate offset errors when the key points are integer values is avoided, and the accuracy of target key point detection is improved. The target object in the image to be detected is rapidly extracted through the position relation between the target key point and the target object, the extraction efficiency and accuracy of the target object are improved, and the accuracy of target detection is further improved.

On the basis of the above embodiments, fig. 16 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. Referring to fig. 16, the target detection apparatus provided in this embodiment specifically includes: an expansion module 31, a preprocessing module 32 and an object detection module 33.

The device comprises an expansion module, a detection module and a processing module, wherein the expansion module is configured to perform zero-pixel expansion on an image to be detected so as to expand the image to be detected into a square image;

the preprocessing module is configured to input the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; the convolution network in the size adjustment model is configured with a deformable convolution layer and a down-sampling layer;

and the target detection module is configured to input the target image into a pre-trained target detection model, obtain a detection result output by the target detection model, and determine the target object in the image to be detected according to the detection result.

On the basis of the above embodiment, the preprocessing module includes: the first feature extraction module is configured to input the square image into a first convolution network to obtain a first feature image output by the first convolution network, and the first convolution network comprises two convolution layers; the second feature extraction module is configured to input the first feature image into a second convolution network to obtain a second feature image output by the second convolution network, and the second convolution network comprises two deformable convolution layers and a down-sampling layer; and the first residual error module is configured to input the square image into the first downsampling layer, and add the output of the first downsampling layer and the second characteristic image to obtain a target image.

On the basis of the above embodiment, the second feature extraction module includes: a down-sampling unit configured to input the first feature image into a second down-sampling layer, resulting in a third feature image output by the second down-sampling layer; a second residual unit configured to input the third feature image into the first deformable convolution layer and add the input and output of the first deformable convolution layer to obtain a fourth feature image; and the feature extraction unit is configured to input the fourth feature image into the second deformable convolution layer to obtain a second feature image output by the second deformable convolution layer.

On the basis of the above embodiment, the target detection module includes: the heat map acquisition unit is configured to input a target image into the target detection model, obtain a heat map output by the target detection model, and determine key points in the heat map; the key point position determining unit is configured to fit a Gaussian function according to the key points and the corresponding adjacent areas of the key points in the heat map, and determine the key point positions in the heat map according to the Gaussian function; and the mapping unit is configured to map the positions of the key points to the image to be detected, determine target key points corresponding to the positions of the key points in the image to be detected, and determine a target object in the image to be detected according to the target key points.

On the basis of the embodiment, the target detection model comprises a backbone network and a deconvolution network; accordingly, the heatmap acquisition unit includes: the system comprises a backbone network subunit and a data processing unit, wherein the backbone network subunit is configured to input a target image into a backbone network to obtain characteristic data output by the backbone network, and the backbone network comprises a convolution layer and three bottleneck layers; and the deconvolution subunit is configured to input the characteristic data into a deconvolution network to obtain a heat map output by the deconvolution network, and the deconvolution network comprises one up-sampling layer and three convolution layers.

On the basis of the above embodiment, the key point position determination unit includes: an adjacent region subunit configured to determine pixel coordinates and a score value of an adjacent region of the keypoint according to the pixel coordinates of the keypoint; the Gaussian function subunit is configured to perform Gaussian surface fitting according to the fractional values and the pixel coordinates of the key points and the pixel coordinates and the fractional values of the adjacent regions to obtain a Gaussian function; a position determining subunit configured to determine coordinates of a center point of the Gaussian function as the key point position.

On the basis of the above embodiment, the object detection device includes: the system comprises a sample acquisition module, a storage module and a display module, wherein the sample acquisition module is configured to acquire a plurality of training sample images and key point coordinates corresponding to marks, and expand the training sample images into square sample images; the sample preprocessing module is configured to input the square sample image into an initial size adjustment model, obtain a target sample image output by the initial size adjustment model, and determine mapping coordinates of the key point coordinates in the target sample image; the sample heat map acquisition module is configured to convert the target sample image into a corresponding Gaussian heat map label according to a preset Gaussian function and mapping coordinates; and the model training module is configured to input the target sample image into the initial target detection model, and adjust the parameters of the size adjustment model and the target detection model according to the Gaussian heatmap and the Gaussian heatmap label output by the initial target detection model.

To sum up, the target detection device provided by the embodiment of the application compresses the size of the image to be detected to the target size through the down-sampling layer in the size adjustment model, adjusts the convolution position through the offset in the deformable convolution layer to fit the shape of the target object in the image to be detected, extracts more comprehensive characteristics, avoids the distortion and deformation of the image characteristics caused by size compression of the image to be detected, improves the integrity and accuracy of the image characteristics in the input image of the target detection model, and improves the accuracy of target detection. In addition, the heat map of the target image is obtained through the target detection model, the key points are determined, the corresponding Gaussian function is obtained through Gaussian fitting of the key points and Gaussian curved surfaces of adjacent regions, the coordinates of the central point of the Gaussian function are used as the coordinates of the floating point value of the key points, the offset error generated when the key points are mapped to the image to be detected when the key points are integer values is avoided, and the accuracy of target key point detection is improved. The target object in the image to be detected is quickly extracted through the position relation between the target key point and the target object, the extraction efficiency and accuracy of the target object are improved, and the accuracy of target detection is further improved.

The target detection device provided by the embodiment of the application can be used for executing the target detection method provided by the embodiment, and has corresponding functions and beneficial effects.

An embodiment of the present application provides an object detection apparatus, and with reference to fig. 17, the object detection apparatus includes: processor 41, memory 42, communication device 43, input device 44, and output device 45. The number of processors in the object detection device may be one or more, and the number of memories in the object detection device may be one or more. The processor, memory, communication means, input means and output means of the object detection device may be connected by a bus or other means.

The memory 42, which is a computer-readable storage medium, may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the object detection method described in any embodiment of the present application (e.g., the expansion module 31, the preprocessing module 32, and the object detection module 33 in the object detection apparatus, the memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, the data storage area may store data created according to the use of the device, etc., and further, the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other non-volatile solid-state storage device, in some examples, the memory may further include a memory remotely disposed with respect to the processor, these remote memories may be connected to the device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The communication means 43 is used for data transmission.

The processor 41 executes various functional applications of the device and data processing by executing software programs, instructions and modules stored in the memory, that is, implements the object detection method described above.

The input device 44 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function controls of the apparatus. The output device 45 may include a display device such as a display screen.

The object detection device provided above can be used to execute the object detection method provided in the above embodiments, and has corresponding functions and advantages.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method of object detection, the method comprising: expanding an image to be detected into a square image by performing zero-pixel expansion on the image to be detected; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; the convolution network in the size adjustment model is configured with a deformable convolution layer and a down-sampling layer; and inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result.

Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network (such as the internet). The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media residing in different locations, e.g., in different computer systems connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium provided in the embodiments of the present application and containing computer-executable instructions is not limited to the object detection method described above, and may also perform related operations in the object detection method provided in any embodiment of the present application.

The object detection device, the apparatus, and the storage medium provided in the above embodiments may perform the object detection method provided in any embodiment of the present application, and refer to the object detection method provided in any embodiment of the present application without detailed technical details described in the above embodiments.

The foregoing is considered as illustrative only of the preferred embodiments of the invention and the principles of the technology employed. The present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the claims.

Claims

1. A method of target detection, comprising:

2. The method of claim 1, wherein the inputting the square image into a pre-trained resizing model to obtain a target image output by the resizing model comprises:

inputting the first characteristic image into a second convolution network to obtain a second characteristic image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a downsampling layer;

3. The method according to claim 2, wherein the inputting the first feature image into a second convolution network to obtain a second feature image output by the second convolution network comprises:

4. The method according to any one of claims 1 to 3, wherein the inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result comprises:

and mapping the positions of the key points to the image to be detected, determining target key points corresponding to the positions of the key points in the image to be detected, and determining a target object in the image to be detected according to the target key points.

5. The method of claim 4, wherein the target detection model comprises a backbone network and a deconvolution network;

6. The method of claim 4, wherein fitting a Gaussian function to the keypoints and corresponding neighboring regions of the keypoints in the heat map, and determining keypoint locations in the heat map according to the Gaussian function comprises:

determining the pixel coordinates and the fractional values of the adjacent areas of the key points according to the pixel coordinates of the key points;

and determining the coordinates of the central point of the Gaussian function as the position of the key point.

7. The method of claim 4, wherein the target detection method further comprises:

8. An object detection device, comprising:

9. An object detection device, comprising:

a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the object detection method of any one of claims 1-7.

10. A storage medium containing computer-executable instructions for performing the object detection method of any one of claims 1-7 when executed by a computer processor.