CN114663671B

CN114663671B - Target detection method, device, equipment and storage medium

Info

Publication number: CN114663671B
Application number: CN202210158387.8A
Authority: CN
Inventors: 埃德温·威廉·特雷霍·庞特
Original assignee: PCI Technology Group Co Ltd; PCI Technology and Service Co Ltd
Current assignee: PCI Technology Group Co Ltd; PCI Technology and Service Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2023-07-18
Anticipated expiration: 2042-02-21
Also published as: CN114663671A

Abstract

The application discloses a target detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: zero pixel expansion is carried out on the image to be detected, so that the image to be detected is expanded into a square image; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer; inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result. By the technical means, the problem that in the prior art, the accuracy of target detection is low due to the feature loss and deformation of the image to be detected is solved, the integrity and the accuracy of image feature data extracted by the deep learning model are ensured, and the accuracy of target detection is improved.

Description

Target detection method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a target detection method, a target detection device, target detection equipment and a storage medium.

Background

The object detection technique is a technique for locating a target object in an image in computer vision, and the object detection technique generally detects the target object in the image through a deep learning model. However, the input image of the deep learning model is of a fixed size, and the size of the image to be detected is different, so that the image to be detected needs to be preprocessed, and the image to be detected is converted into an image meeting the size requirement of the deep learning model.

The preprocessing method in the current target detection is to scale the image to be detected by an interpolation method according to the size of the image to be detected and the proportion of the size of the input image of the deep learning model. However, when the image is scaled, partial image features are often lost and deformed, so that the integrity and accuracy of image feature data extracted by the deep learning model are affected, and the accuracy of target detection is reduced.

Disclosure of Invention

The embodiment of the application provides a target detection method, device, equipment and storage medium, which solve the problem of low target detection accuracy caused by feature loss and deformation of an image to be detected in the prior art, ensure the integrity and accuracy of image feature data extracted by a deep learning model and improve the target detection accuracy.

In a first aspect, an embodiment of the present application provides a target detection method, including:

zero pixel expansion is carried out on the image to be detected so as to expand the image to be detected into a square image;

inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer;

inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result.

Further, the inputting the square image into a pre-trained resizing model to obtain a target image output by the resizing model includes:

inputting the square image into a first convolution network to obtain a first characteristic image output by the first convolution network, wherein the first convolution network comprises two convolution layers;

inputting the first characteristic image into a second convolution network to obtain a second characteristic image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a downsampling layer;

And inputting the square image into a first downsampling layer, and adding the output of the first downsampling layer and the second characteristic image to obtain the target image.

Further, the inputting the first feature image into a second convolution network to obtain a second feature image output by the second convolution network includes:

inputting the first characteristic image into a second downsampling layer to obtain a third characteristic image output by the second downsampling layer;

inputting the third characteristic image into a first deformable convolution layer, and adding the input and the output of the first deformable convolution layer to obtain a fourth characteristic image;

and inputting the fourth characteristic image into a second deformable convolution layer to obtain a second characteristic image output by the second deformable convolution layer.

Further, the inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result includes:

inputting the target image into the target detection model to obtain a heat map output by the target detection model, and determining key points in the heat map;

Fitting a Gaussian function according to the key points and adjacent areas corresponding to the key points in the heat map, and determining the positions of the key points in the heat map according to the Gaussian function;

mapping the key point position to the image to be detected, determining a target key point corresponding to the key point position in the image to be detected, and determining a target object in the image to be detected according to the target key point.

Further, the target detection model comprises a backbone network and a deconvolution network;

correspondingly, the step of inputting the target image into the target detection model to obtain a heat map output by the target detection model includes:

inputting the target image into the backbone network to obtain characteristic data output by the backbone network, wherein the backbone network comprises a convolution layer and three bottleneck layers;

and inputting the characteristic data into the deconvolution network to obtain a heat map output by the deconvolution network, wherein the deconvolution network comprises an up-sampling layer and three convolution layers.

Further, the fitting a gaussian function according to the keypoints and the adjacent areas corresponding to the keypoints in the heat map, and determining the positions of the keypoints in the heat map according to the gaussian function includes:

Determining pixel coordinates and fraction values of adjacent areas of the key points according to the pixel coordinates of the key points;

performing Gaussian surface fitting according to the score value and the pixel coordinate of the key point and the pixel coordinate and the score value of the adjacent area to obtain the Gaussian function;

and determining the coordinates of the central point of the Gaussian function as the position of the key point.

Further, the target detection method further includes:

acquiring a plurality of training sample images and key point coordinates of corresponding marks, and expanding the training sample images into square sample images;

inputting the square sample image into an initial size adjustment model to obtain a target sample image output by the initial size adjustment model, and determining the mapping coordinates of the key point coordinates in the target sample image;

converting the target sample image into a corresponding Gaussian thermal icon label according to a preset Gaussian function and the mapping coordinate;

and inputting the target sample image into an initial target detection model, and adjusting parameters of the size adjustment model and the target detection model according to a Gaussian heat map and a Gaussian heat map label output by the initial target detection model.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the expansion module is configured to perform zero pixel expansion on an image to be detected so as to expand the image to be detected into a square image;

the preprocessing module is configured to input the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer;

the target detection module is configured to input the target image into a pre-trained target detection model, obtain a detection result output by the target detection model, and determine a target object in the image to be detected according to the detection result.

Further, the preprocessing module includes: the first feature extraction module is configured to input the square image into a first convolution network to obtain a first feature image output by the first convolution network, wherein the first convolution network comprises two convolution layers; the second feature extraction module is configured to input the first feature image into a second convolution network to obtain a second feature image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a downsampling layer; and the first residual error module is configured to input the square image into a first downsampling layer and add the output of the first downsampling layer with the second characteristic image to obtain the target image.

Further, the second feature extraction module includes: the downsampling unit is configured to input the first characteristic image into a second downsampling layer to obtain a third characteristic image output by the second downsampling layer; a second residual unit configured to input the third feature image into a first deformable convolution layer and add the input and output of the first deformable convolution layer to obtain a fourth feature image; and the feature extraction unit is configured to input the fourth feature image into a second deformable convolution layer to obtain a second feature image output by the second deformable convolution layer.

Further, the object detection module includes: a heat map acquisition unit configured to input the target image into the target detection model, obtain a heat map output by the target detection model, and determine key points in the heat map; a keypoint location determining unit configured to fit a gaussian function according to the keypoints and corresponding neighboring regions of the keypoints in the heat map, and determine the keypoint locations in the heat map according to the gaussian function; the mapping unit is configured to map the key point position into the image to be detected, determine a target key point corresponding to the key point position in the image to be detected, and determine a target object in the image to be detected according to the target key point.

Further, the target detection model comprises a backbone network and a deconvolution network; correspondingly, the heat map acquisition unit includes: the main network subunit is configured to input the target image into the main network to obtain characteristic data output by the main network, and the main network comprises a convolution layer and three bottleneck layers; and the deconvolution subunit is configured to input the characteristic data into the deconvolution network to obtain a heat map output by the deconvolution network, and the deconvolution network comprises an up-sampling layer and three convolution layers.

Further, the keypoint location determining unit includes: a neighboring region subunit configured to determine pixel coordinates and score values of neighboring regions of the key point according to the pixel coordinates of the key point; the Gaussian function subunit is configured to perform Gaussian surface fitting according to the score value and the pixel coordinate of the key point and the pixel coordinate and the score value of the adjacent area to obtain the Gaussian function; a location determination subunit configured to determine a center point coordinate of the gaussian function as the keypoint location.

Further, the object detection device includes: the sample acquisition module is configured to acquire a plurality of training sample images and key point coordinates of corresponding marks, and expand the training sample images into square sample images; the sample preprocessing module is configured to input the square sample image into an initial size adjustment model, obtain a target sample image output by the initial size adjustment model, and determine the mapping coordinates of the key point coordinates in the target sample image; the sample heat map acquisition module is configured to convert the target sample image into a corresponding Gaussian heat map label according to a preset Gaussian function and the mapping coordinates; and the model training module is configured to input the target sample image into an initial target detection model, and adjust parameters of the size adjustment model and the target detection model according to a Gaussian heat map and a Gaussian heat map label output by the initial target detection model.

In a third aspect, an embodiment of the present application provides an object detection apparatus, including:

a memory and one or more processors;

the memory is used for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the target detection method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a storage medium containing computer executable instructions, which when executed by a computer processor, are for performing the object detection method according to the first aspect.

The method comprises the steps of performing zero pixel expansion on an image to be detected to expand the image to be detected into a square image; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer; inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result. Through the technical means, the size adjustment model compresses the size of the image to be detected to the target size through the downsampling layer, the convolution position is adjusted through the offset in the deformable convolution layer to fit the shape of the target object in the image to be detected, the more comprehensive characteristics are extracted, distortion and deformation of the image characteristics due to size compression of the image to be detected are avoided, the integrity and the accuracy of the image characteristics in the input image of the target detection model are improved, and the accuracy of target detection is improved.

Drawings

FIG. 1 is a flow chart of a method for object detection provided in one embodiment of the present application;

fig. 2 is a schematic diagram of an image to be detected expanding to a square image according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of a resizing model provided in an embodiment of the present application;

FIG. 4 is a flow chart for resizing a square image by a resizing model provided in embodiments of the present application;

FIG. 5 is a flow chart of extracting a second feature image through a second convolution network provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a square image provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a conventional target image according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a target image according to an embodiment of the present disclosure;

FIG. 9 is a flowchart for determining a target object in an image to be detected according to an embodiment of the present application;

FIG. 10 is a flowchart of an object detection model extraction heat map provided by an embodiment of the present application;

FIG. 11 is a flow chart for determining keypoint locations based on Gaussian fitting provided by an embodiment of the application;

FIG. 12 is a schematic diagram of an output heat map of the object detection model provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of a critical point adjacent area provided in an embodiment of the present application;

FIG. 14 is a schematic view of an image including a palm provided in an embodiment of the present application;

FIG. 15 is a flow chart of model training provided by an embodiment of the present application;

FIG. 16 is a schematic diagram of a target detection apparatus according to an embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of an object detection device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the following detailed description of specific embodiments thereof is given with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the application and not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the matters related to the present application are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The target detection method, the device, the equipment and the storage medium aim at expanding an image to be detected into a square image by performing zero pixel expansion on the image to be detected; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer; inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result. Compared with the traditional preprocessing method in target detection, the preprocessing method scales the image to be detected by an interpolation method according to the size of the image to be detected and the proportion of the size of the input image of the deep learning model. However, when the image is scaled, the preprocessing method often loses part of image characteristics and causes deformation of the image characteristics, so that the integrity and accuracy of image characteristic data are affected, and the accuracy of target detection is reduced. Based on the above, the target detection method, device, equipment and storage medium of the embodiment of the application are provided to solve the problem of low accuracy of the existing target detection.

Fig. 1 is a flowchart of a target detection method according to an embodiment of the present application. The object detection method provided in this embodiment may be performed by an object detection device, where the object detection device may be implemented by software and/or hardware, and the object detection device may be configured by two or more physical entities or may be configured by one physical entity.

The following description will be made taking an object detection apparatus as an example of a main body that performs an object detection method. Referring to fig. 1, the target detection method includes:

s110, performing zero pixel expansion on the image to be detected to expand the image to be detected into a square image.

The image to be detected is an original image which needs to be subjected to target detection, and the original image contains a target object to be detected. The image to be detected can be a gray image directly collected by an infrared camera, or can be a gray image obtained by gray processing of a color image collected by an RGB camera. Further, since the sizes of the original images collected by different cameras or different camera modes are different, the target size of the input image of the target detection model is a fixed size, and before the target object in the original image is detected by the target detection model, the size of the original image can be scaled to the target size.

In an embodiment, the target size of the input image of the target detection model of the present embodiment may be set to 128×128, that is, the input image is a square image. If the image to be detected is a non-square image, the image to be detected is directly scaled into a square image, which can cause the deformation of the characteristics in the image to be detected and affect the accuracy of target detection. And before the size of the image to be detected is scaled to the target size, the image to be detected is placed in the corresponding square image according to the maximum size of the width size and the height size of the image to be detected as the side length size of the square image, and zero pixels are filled in the remaining areas of the square image except the image to be detected, so that the square image corresponding to the image to be detected is obtained. In this embodiment, fig. 2 is a schematic diagram of the image to be detected provided in the embodiment of the present application expanding into a square image. As shown in fig. 2, if the height H of the image to be detected is greater than the width W, the zero pixel region with the size of [ H, H-W ] is expanded on the right side of the image to be detected, so as to obtain a square image corresponding to the image to be detected. If the height H of the image to be detected is smaller than the width W, expanding a zero pixel area with the size of [ W-H, W ] at the lower end of the image to be detected, and obtaining a square image corresponding to the image to be detected. Wherein X is a size number. It should be noted that the image to be detected may be disposed at any position in the square image, and is not limited to being disposed at the left side or the upper end.

S120, inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer.

The size adjustment model is a convolution network for adjusting the size of a square image of the image to be detected to a target size. The target image is an input image of the target detection model, and the target size is the size of the input image. A deformable convolutional layer and a downsampling layer are configured in the convolutional network, the downsampling layer being used to downscale the image resolution to reduce the size of the image to a target size. The deformable convolution layer is used to preserve shape features in the image as it is compressed. Illustratively, a deformable convolution layer refers to a convolution layer whose convolution positions can be deformed, which can learn the shape in an image and adjust the convolution positions by an offset to extract shape features into the image. The shape features in the image are important features when the target detection model detects a target object, and the shape features in the image are extracted through deformable convolution, so that deformation distortion of the shape features due to image compression is avoided, and the accuracy of target detection is influenced.

In an embodiment, fig. 3 is a schematic diagram of a network structure of a resizing model according to an embodiment of the present application. As shown in fig. 3, the resizing model includes a first convolution network, a second convolution network, and a first downsampling layer, the input of the resizing model is the input of the first convolution network and the first downsampling layer, the output of the first convolution network is the input of the second convolution network, and the output of the second convolution network and the output of the first downsampling layer are added to the output of the resizing model. In this embodiment, fig. 4 is a flowchart of resizing a square image by a resizing model provided in embodiments of the present application. As shown in fig. 4, the step of resizing the square image by the resizing model specifically includes S1201-S1203:

s1201, inputting the square image into a first convolution network to obtain a first characteristic image output by the first convolution network, wherein the first convolution network comprises two convolution layers.

Referring to fig. 3, the first convolution network includes a first two-dimensional convolution layer and a second two-dimensional convolution layer, the first two-dimensional convolution layer and the second two-dimensional convolution layer being connected in series to form the first convolution network. And after the square image is input into the first convolution network, sequentially extracting depth features in the square image through the first two-dimensional convolution layer and the second two-dimensional convolution layer to obtain a first feature image. In one embodiment, the first two-dimensional convolution layer has a channel number of 16, a convolution kernel size of 7, a step size of 1, and a fill pixel number of 3. The number of channels of the second two-dimensional convolution layer is 16, the convolution kernel size is 3, the step size is 1, and the number of filling pixels is 1. The first characteristic image output by the first convolution network has the same size as the square image.

S1202, inputting the first characteristic image into a second convolution network to obtain a second characteristic image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a downsampling layer.

Referring to fig. 3, the second convolution network includes a first deformable convolution layer, a second deformable convolution layer, and a second downsampling layer, an input of the second convolution network is an input of the second downsampling layer, an output of the second downsampling layer is an input of the first deformable convolution layer, an output of the first deformable convolution layer is added to the input to the second deformable convolution layer, and an output of the second deformable convolution layer is an output of the second convolution network. In one embodiment, the second downsampling layer uses bilinear interpolation to resize the input image of the second convolutional network to the target size. The first deformable convolution layer has a channel number of 16, a convolution kernel size of 3, a step size of 1, and a number of fill pixels of 1. The second deformable convolution layer has a channel number of 1, a convolution kernel size of 3, a step size of 1, and a number of filler pixels of 1. It should be noted that only the second downsampling layer in the second convolution network adjusts the size of the corresponding input image, while the deformable convolution layer does not, i.e. the input/output size of the deformable convolution layer remains unchanged.

In this embodiment, fig. 5 is a flowchart of extracting a second feature image through a second convolution network provided in an embodiment of the present application. As shown in fig. 5, the step of extracting the second feature image through the second convolution network specifically includes S12021 to S12023:

s12021, inputting the first feature image into the second downsampling layer to obtain a third feature image output by the second downsampling layer.

Illustratively, the second downsampling layer compresses the size of the first feature image extracted by the first convolutional network to a target size, resulting in a third feature image.

S12022, inputting the third feature image into the first deformable convolution layer, and adding the input and output of the first deformable convolution layer to obtain a fourth feature image.

The first deformable convolution layer is illustratively composed of a conventional two-dimensional convolution layer and an offset convolution layer, wherein a convolution kernel in the conventional two-dimensional convolution layer is used for convolving with a pixel value at a position corresponding to an input image, the convolution kernel in the offset convolution layer is an offset and is used for learning a characteristic shape in the input image, and the convolution position in the conventional two-dimensional convolution layer is adjusted through the offset, so that the convolution kernel of the conventional two-dimensional convolution layer extracts shape characteristics in the input image, and deformation distortion of the image caused by size compression is avoided.

In this embodiment, the resized third feature image is input into the first deformable convolution layer, which extracts shape features in the third feature image. In order to avoid gradient dispersion or gradient explosion caused by excessively increasing depth of an image, a residual network structure is introduced, namely, the input features and the output features of the first deformable convolution layer are added, and the input features are reserved.

S12023, inputting the fourth characteristic image into the second deformable convolution layer to obtain a second characteristic image output by the second deformable convolution layer.

Similarly, the second deformable convolution layer is also comprised of a conventional two-dimensional convolution layer and an offset convolution layer. And extracting shape features in the fourth feature image through the second deformable convolution layer to obtain a second feature image output by the second deformable convolution layer.

S1203, inputting the square image into the first downsampling layer, and adding the output of the first downsampling layer to the second feature image to obtain the target image.

Wherein the first downsampling layer adopts bilinear interpolation to adjust the size of the square image to the target size. The second characteristic image is a depth characteristic image obtained after depth characteristics are extracted from the square image through the first convolution network and the second convolution network, and a residual error network structure is introduced to avoid gradient dispersion or gradient explosion of the second characteristic image caused by excessively increasing depth. The output image of the first downsampling layer retains the characteristics of the image to be detected, and the output of the first downsampling layer and the output of the second deformable convolution layer are added to obtain a target image retaining the characteristics of the image to be detected.

In an embodiment, the convolution layers in the first and second convolution networks and the deformable convolution layer are configured with an activation function by which nonlinear factors in the convolution networks are increased to extract effective depth features into the image. Illustratively, the activation function may be a Relu function.

In an embodiment, fig. 6 is a schematic diagram of a square image provided in an embodiment of the present application. Fig. 7 is a schematic diagram of a conventional target image according to an embodiment of the present application. Fig. 8 is a schematic diagram of a target image according to an embodiment of the present application. As shown in fig. 6 to 8, the square image has a size of 640×640, the conventional target image has a size of 128×128, and the conventional target image is an input image of the target detection model obtained by compressing the square image by bilinear interpolation in the conventional preprocessing method. Referring to fig. 7, the shape of a target object in a conventional target image obtained by bilinear interpolation may be deformed and distorted, affecting the accuracy of target detection. Referring to fig. 8, the target image in fig. 8 is an input image of a target detection model obtained by compressing a square image by a resizing model. The size adjustment model effectively reserves the shape characteristics of the target object while compressing the image size, and ensures the accuracy of target detection.

S130, inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result.

In an embodiment, the target detection model is a network model for detecting a keypoint of the target object, the target image is input into the target detection model to obtain a heat map output by the target detection model, the position of the keypoint is determined according to the heat map, and the target object in the original image is determined based on the position of the keypoint. In this embodiment, fig. 9 is a flowchart of determining a target object in an original image provided in an embodiment of the present application. As shown in fig. 9, the step of determining the target object in the original image specifically includes S1301-S1303:

s1301, inputting the target image into a target detection model, obtaining a heat map output by the target detection model, and determining key points in the heat map.

The target detection model is a deep learning model for extracting a heat map of a target image, the input of the target detection model is the target image, and the output is the heat map of the target image. And calculating the score of each pixel point in the output heat map, and determining the pixel point with the highest score as a key point in the output heat map.

In an embodiment, the object detection model is composed of a backbone network and a deconvolution network, the backbone network is used for extracting depth features of the object image, the depth features are used as input of the deconvolution network, and the deconvolution network outputs a heat map according to the depth features. In this embodiment, fig. 10 is a flowchart of an object detection model extraction heat map provided in an embodiment of the present application. As shown in fig. 10, the step of extracting a heat map by the object detection model specifically includes S13011-13012:

s13011, inputting the target image into a backbone network to obtain feature data output by the backbone network, wherein the backbone network comprises a convolution layer and three bottleneck layers.

The backbone network in this embodiment adopts a model structure of MobileNet v2, which includes one convolution layer and three bottleneck layers, and the convolution layer is a two-dimensional convolution layer. The backbone network extracts depth feature data from the target image through the convolution layer and the bottleneck layer. In addition, compared with the traditional structure model of MobileNet v2, the backbone network in the embodiment eliminates a repeated up-sampling layer and a cutting layer, and greatly reduces the reasoning time under the condition of not sacrificing the model precision.

S13012, inputting the characteristic data into a deconvolution network to obtain a heat map output by the deconvolution network, wherein the deconvolution network comprises an up-sampling layer and three convolution layers.

Illustratively, the number of heat maps output by the deconvolution network is the same as the number of target key points. For example, the key points are the connection point of the index finger and the middle finger, the connection point of the middle finger and the ring finger, and the connection point of the ring finger and the little finger, and correspondingly, three heat maps output by the deconvolution network, wherein one heat map correspondingly comprises one connection point.

And S1302, fitting a Gaussian function according to the key points and the adjacent areas corresponding to the key points in the heat map, and determining the positions of the key points in the heat map according to the Gaussian function.

The key point position refers to position coordinates of floating point values of key points in the heat map. For example, since the key point with the highest score in the heat map belongs to a pixel point in the heat map, the pixel coordinates of the pixel point are integer values, and a pixel area instead of a pixel point is determined after the pixel coordinates of the integer values are mapped into the image to be detected, so that offset errors occur in key point detection, and the detection result is not accurate enough. Therefore, the embodiment proposes to determine the position coordinates of the floating point values of the key points by using the score value distribution of the key points and the adjacent areas in the Gaussian fit perceptual heat map. In this embodiment, FIG. 11 is a flow chart for determining keypoint locations based on Gaussian fitting as provided by embodiments of the present application. As shown in fig. 11, the step of determining the positions of the key points according to gaussian fitting specifically includes S13021-S13023:

S13021, determining the pixel coordinates and score values of the neighboring areas of the key point according to the pixel coordinates of the key point.

Fig. 12 is a schematic diagram of an output heat map of the object detection model provided in an embodiment of the present application. As shown in fig. 12, fig. 12 is one of the Zhang Re graphs output by the object detection model, and the pixel point with the greatest brightness in the heat map is the key point in the heat map, and the pixel coordinates of the key point in the heat map are (14, 10). The neighboring region of the key point refers to a region formed by pixel points spaced apart from the key point by a distance less than a certain threshold value. Fig. 13 is a schematic diagram of a neighboring area of a key point according to an embodiment of the present application. As shown in fig. 13, the neighboring area of the keypoint may be a rectangular area centering around and surrounding the keypoint, and the neighboring area and the keypoint form a 3×3 rectangular area. In this embodiment, when the pixel coordinates of the key point are (14, 10), the pixel coordinates of the corresponding adjacent areas are (13, 9), (13, 10), (13, 11), (14, 9), (14, 11), (15, 9), (15, 10), and (15, 11). And obtaining the fraction value of the pixel point corresponding to each pixel coordinate from the corresponding heat map according to the pixel coordinates of the adjacent areas.

And S13022, performing Gaussian surface fitting according to the score value and the pixel coordinate of the key point and the pixel coordinate and the score value of the adjacent region to obtain a Gaussian function.

The present embodiment is described taking the case where adjacent areas and key points constitute a 3×3 rectangular area. From the above, the expression of the Gaussian function is

The standard deviation σ of the gaussian distribution adopted in the embodiment is the same as the standard deviation σ adopted in the training of the model, f (x, y) is a fraction value where the pixel coordinates are (x, y), and (x 0, y 0) is the center point coordinates of the gaussian function and is also an unknown in the gaussian function. Substituting the fractional value and the pixel coordinate of the key point into the Gaussian function to obtain an equation about x0 and y0, and substituting the pixel coordinate and the fractional value of the adjacent region into the Gaussian function to obtain an equation set about x0 and y 0. And solving x0 and y0 by combining the equation and the equation set to obtain a Gaussian function.

And S13023, determining the coordinates of the central point of the Gaussian function as the key point position.

For example, since the function value at the center point of the gaussian function is the maximum value of the gaussian function, the score value at the center point of the gaussian function fitted based on the key point and the adjacent region is the maximum value in the whole rectangular region, and the center point coordinate (x 0, y 0) of the gaussian function is further used as the floating point coordinate of the key point.

And S1303, mapping the positions of the key points into an image to be detected, determining target key points corresponding to the key points in the image to be detected, and determining target objects in the image to be detected according to the target key points.

After determining the position of the key point in the heat map according to Gaussian fitting, determining a corresponding pixel point in the original image according to the key point, and taking the pixel point as a target key point. In one embodiment, the pixel coordinates of the keypoint locations mapped in the square image are determined as the pixel coordinates of the target keypoints according to the size ratio of the heat map to the square image.

Further, assume that the palm center is a target object in the original image including the palm, and there is a certain positional relationship between the connection points of the palm center and the palm. After three heat maps output by the target detection model are obtained, the floating point value coordinate position of the connecting point in each heat map is determined, the floating point value coordinate position is mapped into the original image, and the three connecting points in the palm image are determined. Fig. 14 is a schematic view of an image including a palm according to an embodiment of the present application. As shown in fig. 14, points a, B and C are three target key points in the palm image, respectively, where point a is a connection point between the little finger and the ring finger, point B is a connection point between the middle finger and the ring finger, and point C is a connection point between the middle finger and the index finger. p1, p2, p3 and p4 are four vertexes of the region of interest, a certain position relationship exists among p1, p2, p3 and p4, the point A, the point B and the point C, the pixel coordinates of p1, p2, p3 and p4 can be uniquely determined according to the pixel coordinates of the point A, the point B and the point C, and then a target object in an original image is intercepted according to p1, p2, p3 and p 4.

In one embodiment, FIG. 15 is a flow chart of model training provided by embodiments of the present application. As shown in fig. 15, the model training includes steps S210 to S240:

s210, acquiring a plurality of training sample images and key point coordinates of corresponding marks, and expanding the training sample images into square sample images.

The training sample image is an image obtained by gray processing of an image acquired by a camera. And expanding the training sample image into a square sample image according to the length size and the width size of the training sample image, and determining that the coordinates of the marked key points in the training sample image correspond to the pixel coordinates in the square sample image.

S220, inputting the square sample image into an initial size adjustment model to obtain a target sample image output by the initial size adjustment model, and determining the mapping coordinates of the key point coordinates in the target sample image.

Wherein the initial resizing model is an untrained resizing model. Illustratively, a square sample image is input into an initial resizing model to obtain a target sample image output by the initial resizing model. And scaling the coordinates of the key point pixels in the square sample image in equal proportion according to the size ratio of the square sample image to the target sample image, so as to obtain the coordinates of the key point pixels in the target sample image.

S230, converting the target sample image into a corresponding Gaussian thermal icon label according to a preset Gaussian function and mapping coordinates.

Illustratively, the expression of the gaussian is:

wherein (mu) _x ,μ _y ) Is the coordinates of the pixels of the keypoints in the target sample image. Sigma is the standard deviation of Gaussian distribution, and sigma can be set according to actual requirements. f (x, y) is the score value of the gaussian heat map at the pixel point with the pixel coordinates (x, y). Substituting the pixel coordinates of each pixel point of the target sample image and the coordinates of the key points into the Gaussian function to obtain the fraction value of each pixel point in the Gaussian heat map, and finally obtaining the Gaussian heat map label.

S240, inputting the target sample image into an initial target detection model, and adjusting parameters of a size adjustment model and the target detection model according to a Gaussian heat map and a Gaussian heat map label output by the initial target detection model.

Wherein the initial target detection model is an untrained target detection model. And in the training stage, substituting the Gaussian thermal icon labels corresponding to the thermal diagram output by the initial target detection model and the target sample image into a loss function, and adjusting model parameters of the size adjustment model and the target detection model according to a loss result output by the loss function until the model converges or the iteration number reaches an upper limit.

In summary, the size of the image to be detected is compressed to the target size through the downsampling layer in the size adjustment model, the convolution position is adjusted through the offset in the deformable convolution layer to fit the shape of the target object in the image to be detected, the more comprehensive characteristics are extracted, distortion and deformation of the image characteristics of the image to be detected caused by size compression are avoided, the integrity and the accuracy of the image characteristics in the input image of the target detection model are improved, and the accuracy of target detection is improved. In addition, the heat map of the target image is obtained through the target detection model, key points are determined, a Gaussian curve of the key points and adjacent areas is fitted through Gaussian to obtain a corresponding Gaussian function, the center point coordinates of the Gaussian function are used as floating point value coordinates of the key points, offset errors caused by mapping to the image to be detected when the key points are integer values are avoided, and accuracy of target key point detection is improved. And the target object in the image to be detected is rapidly extracted through the position relation between the target key points and the target object, so that the extraction efficiency and accuracy of the target object are improved, and the accuracy of target detection is further improved.

On the basis of the above embodiments, fig. 16 is a schematic structural diagram of an object detection device according to an embodiment of the present application. Referring to fig. 16, the object detection apparatus provided in this embodiment specifically includes: an expansion module 31, a preprocessing module 32 and a target detection module 33.

The expansion module is configured to perform zero pixel expansion on the image to be detected so as to expand the image to be detected into a square image;

the target detection module is configured to input a target image into a pre-trained target detection model, obtain a detection result output by the target detection model, and determine a target object in the image to be detected according to the detection result.

On the basis of the above embodiment, the preprocessing module includes: the first feature extraction module is configured to input a square image into a first convolution network to obtain a first feature image output by the first convolution network, and the first convolution network comprises two convolution layers; the second feature extraction module is configured to input the first feature image into a second convolution network to obtain a second feature image output by the second convolution network, and the second convolution network comprises two deformable convolution layers and a downsampling layer; and the first residual error module is configured to input the square image into the first downsampling layer and add the output of the first downsampling layer with the second characteristic image to obtain a target image.

On the basis of the above embodiment, the second feature extraction module includes: the downsampling unit is configured to input the first characteristic image into the second downsampling layer to obtain a third characteristic image output by the second downsampling layer; a second residual unit configured to input a third feature image into the first deformable convolution layer and add the input and output of the first deformable convolution layer to obtain a fourth feature image; and the feature extraction unit is configured to input the fourth feature image into the second deformable convolution layer to obtain a second feature image output by the second deformable convolution layer.

On the basis of the above embodiment, the object detection module includes: the heat map acquisition unit is configured to input the target image into the target detection model, obtain a heat map output by the target detection model and determine key points in the heat map; the key point position determining unit is configured to fit a Gaussian function according to the key points and the adjacent areas corresponding to the key points in the heat map, and determine the positions of the key points in the heat map according to the Gaussian function; the mapping unit is configured to map the positions of the key points into the image to be detected, determine target key points corresponding to the key points in the image to be detected, and determine target objects in the image to be detected according to the target key points.

On the basis of the above embodiment, the object detection model includes a backbone network and a deconvolution network; correspondingly, the heat map acquisition unit includes: the main network subunit is configured to input the target image into a main network to obtain characteristic data output by the main network, and the main network comprises a convolution layer and three bottleneck layers; and the deconvolution subunit is configured to input the characteristic data into a deconvolution network to obtain a heat map output by the deconvolution network, and the deconvolution network comprises an up-sampling layer and three convolution layers.

On the basis of the above embodiment, the key point position determining unit includes: a neighboring region subunit configured to determine pixel coordinates and score values of neighboring regions of the key point according to the pixel coordinates of the key point; the Gaussian function subunit is configured to perform Gaussian surface fitting according to the score value and the pixel coordinate of the key point and the pixel coordinate and the score value of the adjacent area to obtain a Gaussian function; and a position determination subunit configured to determine the center point coordinates of the gaussian function as the keypoint positions.

On the basis of the above-described embodiment, the object detection device includes: the sample acquisition module is configured to acquire a plurality of training sample images and key point coordinates of corresponding marks, and expand the training sample images into square sample images; the sample preprocessing module is configured to input a square sample image into the initial size adjustment model, obtain a target sample image output by the initial size adjustment model, and determine mapping coordinates of key point coordinates in the target sample image; the sample heat map acquisition module is configured to convert the target sample image into a corresponding Gaussian heat map label according to a preset Gaussian function and mapping coordinates; the model training module is configured to input the target sample image into an initial target detection model, and adjust parameters of the size adjustment model and the target detection model according to the Gaussian heat map and the Gaussian heat map label output by the initial target detection model.

In summary, according to the target detection device provided by the embodiment of the application, the size of the image to be detected is compressed to the target size through the downsampling layer in the size adjustment model, the convolution position is adjusted through the offset in the deformable convolution layer so as to fit the shape of the target object in the image to be detected, the more comprehensive characteristics are extracted, the distortion and deformation of the image characteristics caused by the size compression of the image to be detected are avoided, the integrity and the accuracy of the image characteristics in the input image of the target detection model are improved, and the accuracy of target detection is improved. In addition, the heat map of the target image is obtained through the target detection model, key points are determined, a Gaussian curve of the key points and adjacent areas is fitted through Gaussian to obtain a corresponding Gaussian function, the center point coordinates of the Gaussian function are used as floating point value coordinates of the key points, offset errors caused by mapping to the image to be detected when the key points are integer values are avoided, and accuracy of target key point detection is improved. And the target object in the image to be detected is rapidly extracted through the position relation between the target key points and the target object, so that the extraction efficiency and accuracy of the target object are improved, and the accuracy of target detection is further improved.

The object detection device provided by the embodiment of the application can be used for executing the object detection method provided by the embodiment, and has corresponding functions and beneficial effects.

An embodiment of the present application provides an object detection apparatus, referring to fig. 17, including: a processor 41, a memory 42, a communication device 43, an input device 44 and an output device 45. The number of processors in the object detection device may be one or more and the number of memories in the object detection device may be one or more. The processor, memory, communication means, input means and output means of the object detection device may be connected by a bus or other means.

The memory 42 may be implemented as a computer-readable storage medium for storing software programs, computer-executable programs, and modules corresponding to the object detection method described in any of the embodiments herein (e.g., the expansion module 31, the preprocessing module 32, and the object detection module 33 in the object detection apparatus. The memory may primarily include a memory program area that may store an operating system, application programs required for at least one function, a memory data area that may store data created based on the use of the device, etc., the memory may further include high-speed random access memory, and may also include nonvolatile memory such as at least one disk storage device, flash memory device, or other nonvolatile solid state storage device. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to the device via a network, examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The communication means 43 are for data transmission.

The processor 41 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory, i.e., implements the above-described target detection method.

The input device 44 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the apparatus. The output means 45 may comprise a display device such as a display screen.

The object detection device provided by the above embodiment can be used for executing the object detection method provided by the above embodiment, and has corresponding functions and beneficial effects.

The present embodiments also provide a storage medium containing computer executable instructions, which when executed by a computer processor, are for performing an object detection method comprising: zero pixel expansion is carried out on the image to be detected, so that the image to be detected is expanded into a square image; inputting the square image into a pre-trained size adjustment model to obtain a target image output by the size adjustment model; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer; inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining a target object in the image to be detected according to the detection result.

Storage media-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, lanbas (Rambus) RAM, etc.; nonvolatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a second, different computer system connected to the first computer system through a network such as the internet. The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media residing in different locations (e.g., in different computer systems connected by a network). The storage medium may store program instructions (e.g., embodied as a computer program) executable by one or more processors.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present application is not limited to the target detection method described above, and may also perform the related operations in the target detection method provided in any embodiment of the present application.

The object detection device, apparatus and storage medium provided in the foregoing embodiments may perform the object detection method provided in any embodiment of the present application, and technical details not described in detail in the foregoing embodiments may be referred to the object detection method provided in any embodiment of the present application.

The foregoing description is only of the preferred embodiments of the present application and the technical principles employed. The present application is not limited to the specific embodiments described herein, but is capable of numerous obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the present application. Therefore, while the present application has been described in connection with the above embodiments, the present application is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the claims.

Claims

1. A method of detecting an object, comprising:

inputting the square image into a pre-trained resizing model to obtain a target image output by the resizing model, wherein the method comprises the following steps of: inputting the square image into a first convolution network to obtain a first characteristic image output by the first convolution network, wherein the first convolution network comprises two convolution layers, inputting the first characteristic image into a second convolution network to obtain a second characteristic image output by the second convolution network, the second convolution network comprises two deformable convolution layers and one downsampling layer, inputting the square image into the first downsampling layer, and adding the output of the first downsampling layer and the second characteristic image to obtain the target image; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer;

2. The method of claim 1, wherein inputting the first feature image into a second convolution network to obtain a second feature image output by the second convolution network comprises:

3. The method according to any one of claims 1-2, wherein inputting the target image into a pre-trained target detection model to obtain a detection result output by the target detection model, and determining the target object in the image to be detected according to the detection result, includes:

4. The method of claim 3, wherein the object detection model comprises a backbone network and a deconvolution network;

5. A method according to claim 3, wherein said fitting a gaussian function from said keypoints and corresponding adjacent regions of said keypoints in said heat map and determining keypoint locations in said heat map from said gaussian function comprises:

6. A method according to claim 3, wherein the target detection method further comprises:

7. An object detection apparatus, comprising:

a preprocessing module configured to input the square image into a pre-trained resizing model to obtain a target image output by the resizing model, wherein the preprocessing module comprises: the first feature extraction module is configured to input the square image into a first convolution network to obtain a first feature image output by the first convolution network, wherein the first convolution network comprises two convolution layers; the second feature extraction module is configured to input the first feature image into a second convolution network to obtain a second feature image output by the second convolution network, wherein the second convolution network comprises two deformable convolution layers and a downsampling layer; a first residual module configured to input the square image into a first downsampling layer, and add the output of the first downsampling layer to the second feature image to obtain the target image; wherein the convolutional network in the resizing model is configured with a deformable convolutional layer and a downsampling layer;

8. An object detection apparatus, characterized by comprising:

a memory and one or more processors;

the memory is used for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the target detection method of any of claims 1-6.

9. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the object detection method according to any of claims 1-6.