CN110633595B

CN110633595B - Target detection method and device by utilizing bilinear interpolation

Info

Publication number: CN110633595B
Application number: CN201810641832.XA
Authority: CN
Inventors: 张立成
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2022-12-02
Anticipated expiration: 2038-06-21
Also published as: CN110633595A

Abstract

The invention discloses a target detection method and device by utilizing bilinear interpolation, and relates to the technical field of computers. One embodiment of the method comprises: extracting features of the collected image to obtain a feature map of the image; generating a plurality of detection frames corresponding to each position of the feature map of the image to obtain the feature map corresponding to each detection frame; utilizing bilinear interpolation to perform down-sampling on the feature maps corresponding to the detection frames to obtain feature maps with the same size corresponding to the detection frames; and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame. According to the embodiment, the obtained position information of the detection target can be accurate, so that the detection accuracy is improved, the calculation amount of detection is reduced, and the real-time application requirement is met.

Description

Target detection method and device by utilizing bilinear interpolation

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for target detection using bilinear interpolation.

Background

The target detection task is to determine a detection target by using a rectangular frame on an image so as to obtain the position information of the detection target. Currently, the target detection methods are widely used, including RCNN (convolutional neural network based on image regions), fast RCNN (convolutional neural network based on image regions quickly), fast RCNN (convolutional neural network based on image regions more quickly), and the like.

In the conventional fast RCNN and other methods, after processing of each layer is performed before ROI downsampling, two diagonal coordinates (upper left corner and lower right corner) of a detection frame are likely not to be integers, so that there is no feature at the two diagonal coordinates, and therefore, when ROI downsampling is performed, rounding is required to be performed on the two pairs of diagonal coordinates, then 7 × 7 downsampling is performed in the rounded coordinate range, and when 7 × 7 regions are divided, the position of a region boundary point is also determined by using a rounding method. Because rounding is performed, the detection frame and each region thereof have deviation from the corresponding feature map, so that the detected rectangular frame of the detection target has deviation from the real rectangular frame attached to the boundary of the detection target, the obtained position information of the detection target is inaccurate, and the accuracy of detection is reduced because the input feature map has some deviation.

In addition, the existing method selects a VGG16 network (a convolutional neural network) to perform feature extraction on the image, and the VGG16 network has very many parameters and very large calculation amount, so that the requirement of real-time application is difficult to meet.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the existing method causes the position information of the detected target to be inaccurate, thereby reducing the accuracy of detection.

Disclosure of Invention

In view of this, embodiments of the present invention provide a target detection method and apparatus using bilinear interpolation, which can make the obtained position information of the detected target more accurate, thereby improving the detection accuracy, and reducing the detection calculation amount, thereby satisfying the real-time application requirement.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a target detection method using bilinear interpolation.

A target detection method using bilinear interpolation, comprising: extracting features of the collected image to obtain a feature map of the image; generating a plurality of detection frames corresponding to each position of the feature map of the image to obtain the feature map corresponding to each detection frame; utilizing bilinear interpolation to perform down-sampling on the feature maps corresponding to the detection frames to obtain feature maps with the same size corresponding to the detection frames; and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

Optionally, the step of extracting features from the acquired image to obtain a feature map of the image includes: and extracting features of the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image.

Optionally, the step of extracting features from the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image includes: inputting the acquired image into a network model constructed based on the lightweight convolutional neural network; down-sampling a characteristic diagram output by a first activation layer of the network model to obtain a first characteristic diagram; the characteristic diagram output by a second activation layer of the network model is up-sampled to obtain a second characteristic diagram; and splicing the first feature map, the second feature map and a third feature map output by a third activation layer of the network model into the feature map of the image.

Optionally, the step of performing downsampling on the feature map corresponding to each detection frame by using bilinear interpolation to obtain the feature maps with the same size corresponding to each detection frame includes: averagely dividing the feature map corresponding to each detection frame into M1 × M2 areas, wherein M1 and M2 are used for representing the dimension of the feature map with the same size, and the dimension is M1 × M2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; and selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary points of the region and the features of all pixel points in the region.

Optionally, the step of down-sampling the feature maps corresponding to the detection frames by using bilinear interpolation to obtain the feature maps with the same size corresponding to the detection frames includes: performing ROI (region of interest) downsampling on the feature maps corresponding to the detection frames to obtain first feature maps with the same size corresponding to the detection frames; averagely dividing the feature maps corresponding to the detection frames into M1 × M2 areas, wherein M1 and M2 are used for representing the dimension of the feature maps with the same size, and the dimension is M1 × M2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a second feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary points of the region and the features of each pixel point in the region; and splicing the first feature map with the same size and the second feature map with the same size corresponding to each detection frame into the feature maps with the same size corresponding to each detection frame, wherein the dimensions of the second feature map with the same size and the first feature map with the same size are both M1 × M2.

Optionally, the step of generating features of boundary points of each region of each detection frame by bilinear interpolation includes: respectively representing the coordinate of a boundary point of a certain area in a certain detection frame by (X/N, Y/N), wherein N represents that the acquired image is a multiple of the feature map of the image, and generating the feature of the boundary point by a bilinear interpolation method as follows: finding out a maximum integer A1 smaller than X/N and a minimum integer A2 larger than X/N, a maximum integer B1 smaller than Y/N and a minimum integer B2 larger than Y/N; according to the characteristic values of two points with coordinates of (A1, B1) and (A1, B2), interpolating to generate a first intermediate value; interpolating to generate a second intermediate value according to the characteristic values of the two points with the coordinates of (A2, B1) and (A2, B2); according to the first intermediate value and the second intermediate value, a third intermediate value is generated through interpolation, and the third intermediate value is used as the characteristic of the boundary point; and generating respective features for each boundary point of each region of each detection frame through the above process, so as to generate the features of the boundary points of each region of each detection frame.

According to another aspect of an embodiment of the present invention, there is provided an object detection apparatus using bilinear interpolation.

An object detection apparatus using bilinear interpolation, comprising: the characteristic extraction module is used for extracting characteristics of the acquired image to obtain a characteristic diagram of the image; the detection frame generation module is used for generating a plurality of detection frames corresponding to each position of the feature map of the image so as to obtain the feature map corresponding to each detection frame; the down-sampling module is used for down-sampling the characteristic graphs corresponding to the detection frames by utilizing bilinear interpolation to obtain the characteristic graphs with the same size corresponding to the detection frames; and the detection module is used for determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

Optionally, the feature extraction module is further configured to: and extracting features of the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image.

Optionally, the feature extraction module is further configured to: inputting the acquired image into a network model constructed based on the lightweight convolutional neural network; down-sampling a characteristic diagram output by a first activation layer of the network model to obtain a first characteristic diagram; the characteristic diagram output by a second activation layer of the network model is up-sampled to obtain a second characteristic diagram; and splicing the first feature map, the second feature map and a third feature map output by a third activation layer of the network model into the feature map of the image.

Optionally, the down-sampling module is further configured to: averagely dividing the feature maps corresponding to the detection frames into M1 × M2 areas, wherein M1 and M2 are used for representing the dimension of the feature maps with the same size, and the dimension is M1 × M2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; and selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary point of the region and the features of each pixel point in the region.

Optionally, the down-sampling module is further configured to: performing ROI (region of interest) downsampling on the feature maps corresponding to the detection frames to obtain first feature maps with the same size corresponding to the detection frames; averagely dividing the feature map corresponding to each detection frame into M1 × M2 areas, wherein M1 and M2 are used for representing the dimension of the feature map with the same size, and the dimension is M1 × M2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a second feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary points of the region and the features of each pixel point in the region; and splicing the first feature map with the same size and the second feature map with the same size corresponding to each detection frame into the feature maps with the same size corresponding to each detection frame, wherein the dimensions of the second feature map with the same size and the first feature map with the same size are both M1 × M2.

Optionally, the down-sampling module includes a boundary point feature generation sub-module, configured to: respectively representing the coordinate of a boundary point of a certain area in a certain detection frame by (X/N, Y/N), wherein N represents that the acquired image is a multiple of the feature map of the image, and generating the feature of the boundary point by a bilinear interpolation method as follows: finding out a maximum integer A1 smaller than X/N and a minimum integer A2 larger than X/N, a maximum integer B1 smaller than Y/N and a minimum integer B2 larger than Y/N; according to the characteristic values of two points with coordinates of (A1, B1) and (A1, B2), interpolating to generate a first intermediate value; interpolating to generate a second intermediate value according to the characteristic values of the two points with the coordinates of (A2, B1) and (A2, B2); according to the first intermediate value and the second intermediate value, a third intermediate value is generated through interpolation, and the third intermediate value is used as the characteristic of the boundary point; and generating respective features for each boundary point of each region of each detection frame through the above process, so as to generate the features of the boundary points of each region of each detection frame.

According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

An electronic device, comprising: one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a target detection method utilizing bilinear interpolation as provided by the present invention.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the object detection method using bilinear interpolation provided by the present invention.

One embodiment of the above invention has the following advantages or benefits: utilizing bilinear interpolation to perform down-sampling on the feature maps corresponding to the detection frames to obtain feature maps with the same size corresponding to the detection frames; and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame. The obtained position information of the detection target can be more accurate, and therefore the detection accuracy is improved. In addition, the light-weight convolutional neural network is used for extracting features of the acquired image to obtain a feature map of the image, so that the detection calculation amount can be reduced, and the real-time application requirement can be met.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a target detection method using bilinear interpolation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of a target detection method using bilinear interpolation according to an embodiment of the present invention;

FIG. 3 is a schematic main flow chart of a target detection method using bilinear interpolation according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of the main components of an object detection model according to one embodiment of the invention;

FIG. 5 is a schematic diagram of the main blocks of an object detection apparatus using bilinear interpolation according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of main steps of a target detection method using bilinear interpolation according to an embodiment of the present invention.

As shown in fig. 1, the target detection method using bilinear interpolation according to an embodiment of the present invention mainly includes the following steps S101 to S104.

Step S101: and extracting features from the acquired image to obtain a feature map of the acquired image.

Features can be extracted from the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image. Features of the collected image can be extracted by adopting a non-lightweight convolutional neural network such as a VGG16 network.

Step S102: a plurality of detection frames are generated corresponding to each position of the feature map of the image, so as to obtain the feature map corresponding to each detection frame.

Step S103: and performing down-sampling on the feature maps corresponding to the detection frames by utilizing bilinear interpolation to obtain the feature maps with the same size corresponding to the detection frames.

In the embodiment of the present invention, the feature map corresponding to each detection frame is downsampled by using bilinear interpolation, which may be downsampling based on bilinear interpolation, or downsampling combining ROI downsampling and downsampling based on bilinear interpolation, and the two implementation manners will be described in detail in the specific embodiments in fig. 2 and 3 below.

Step S104: and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

The embodiment of the invention can ensure that the obtained position information of the detection target is more accurate, thereby improving the accuracy of detection. In addition, if the light-weight convolutional neural network is used for extracting features of the acquired image in the step S101, the detected calculated amount can be reduced, so that the real-time application requirement is met.

Fig. 2 is a schematic main flow chart of a target detection method using bilinear interpolation according to an embodiment of the present invention.

Step S201: and extracting features of the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image.

The lightweight convolutional neural network may be a MobileNet network.

Step S201, specifically, inputting the collected image into a network model constructed based on a lightweight convolutional neural network; down-sampling a feature map output by a first activation layer of the network model to obtain a first feature map; the characteristic diagram output by a second activation layer of the network model is up-sampled to obtain a second characteristic diagram; and splicing the first feature map, the second feature map and a third feature map output by a third activation layer of the network model into a feature map of the image.

Taking the MobileNet network as an example, the network model can be constructed by the following method: firstly, reserving all layers before relu6/sep (including the relu 6/sep) of the MobileNet network, and connecting a down-sampling layer behind a relu4_1/sep layer of the MobileNet network for down-sampling a characteristic diagram output by the relu4_1/sep layer; an up-sampling layer is connected behind the relu6/sep layer and used for up-sampling the feature map output by the relu6/sep layer; the down-sampling layer, the up-sampling layer and the relu5_5/sep layer of the MobileNet network are all connected with a connecting layer, and the connecting layer is used for splicing the feature graphs output by the relu4_1/sep layer, the relu6/sep layer and the relu5_5/sep layer into the feature graph of the acquired image. The relu4_1/sep layer, the relu6/sep layer and the relu5_5/sep layer are all active layers, the relu4_1/sep layer is the first active layer, the relu6/sep layer is the second active layer, and the relu5_5/sep layer is the third active layer.

The feature map output by the relu4_1/sep layer is one eighth of the size of the original image (i.e., the captured image), and becomes one sixteenth of the size after down-sampling. The feature map output by the relu6/sep layer is thirty-half of the size of the original image, and becomes one sixteenth of the size of the original image after up-sampling. The feature map output by the relu5_5/sep layer is one sixteenth of the size of the original image. Therefore, the feature map of the image obtained by stitching is also one sixteenth of the original image.

The embodiment of the invention can also construct a network model based on other lightweight convolutional neural networks, such as ResNet network (deep residual error network) and the like.

For example, a network model is constructed through a ResNet50 network (a ResNet network with a depth of 50), all layers before res5c _ relu are reserved, and a down-sampling layer is connected after the res3d _ relu layer and is used for down-sampling a characteristic diagram output by the res3d _ relu layer; an up-sampling layer (namely a deconvolution layer) is connected after the res5c _ relu layer and is used for up-sampling the feature map output by the res5c _ relu layer; and then splicing the feature maps output by the res3d _ relu layer, the res5c _ relu layer and the res4f _ relu layer into the feature map of the acquired image. Accordingly, the res3d _ relu layer is the first active layer, the res5c _ relu layer is the second active layer, and the res4f _ relu layer is the third active layer.

By adopting the light-weight convolutional neural network to extract the characteristics of the image, the calculation amount of detection can be reduced, and thus the real-time application requirement is met. Moreover, the characteristic diagram of the acquired image obtained by the network model constructed by the embodiment of the invention can be fused with multilayer characteristic information, so that the final detection result is more accurate.

Step S202: and generating a plurality of detection frames corresponding to each position of the feature map of the image so as to obtain the feature map corresponding to each detection frame.

In this case, the process of generating a plurality of detection frames is a process of determining the positions of the detection frames, and it is determined that each detection frame belongs to the foreground or the background. Specifically, the feature map of the image obtained by splicing the images can be extracted through a convolutional layer, and then a classification layer and a regression layer are connected behind the convolutional layer respectively, wherein the classification layer is used for judging whether the detection frame belongs to the foreground or the background, and the regression layer is used for correcting the position of the detection frame. And the feature map in the range of each detection frame is the feature map corresponding to the detection frame.

Each position of the feature map of the image can generate 9 detection frames, which are respectively 8, 16 and 32 scales, and are combined with three aspect ratios of 0.5, 1 and 2, and the total of 3 multiplied by 3 is equal to 9 detection frames taking the position as the center.

After a plurality of detection frames are generated, in the embodiment of the present invention, a part of detection frames with high confidence may be selected from the generated detection frames as candidate detection frames (referred to as candidate frames for short), and the feature map corresponding to the candidate frames is used as the feature map corresponding to each detection frame obtained in step S202. The detection frames with high confidence are the detection frames output by the classification layer and having a high probability of belonging to the foreground (for example, greater than a certain threshold).

Step S203: and performing bilinear interpolation-based downsampling on the feature map corresponding to each detection frame to obtain the feature map with the same size corresponding to each detection frame.

Step S203 specifically includes: averagely dividing the feature map corresponding to each detection frame into M1M 2 areas, wherein M1 and M2 are used for representing the dimension of the feature map with the same size, and the dimension is M1M 2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; and selecting the features corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining feature maps with the same size corresponding to each detection frame according to the selected features.

The characteristics of each region comprise the characteristics of the boundary point of the region and the characteristics of each pixel point in the region. In the embodiment of the present invention, M1 and M2 are both 7, that is, the feature map corresponding to each detection frame is equally divided into 7 × 7 regions, and the regions are distributed according to 7 × 7 (7 rows and 7 columns) in the feature map, so that after one feature is selected from each region of the detection frame, the feature map obtained according to each selected feature is also 7 × 7 in size.

The step of generating the feature of the boundary point of each region of each detection frame through bilinear interpolation may specifically include:

respectively representing the coordinate of a boundary point of a certain area in a certain detection frame by (X/N, Y/N), wherein N represents that the acquired image is a multiple of the feature map of the image, and generating the feature of the boundary point by a bilinear interpolation method as follows:

finding out a maximum integer A1 smaller than X/N and a minimum integer A2 larger than X/N, a maximum integer B1 smaller than Y/N and a minimum integer B2 larger than Y/N; according to the characteristic values of two points with coordinates of (A1, B1) and (A1, B2), interpolating to generate a first intermediate value; interpolating to generate a second intermediate value according to the characteristic values of the two points with the coordinates of (A2, B1) and (A2, B2); according to the first intermediate value and the second intermediate value, a third intermediate value is generated through interpolation, and the third intermediate value is used as the characteristic of the boundary point;

the above-described process generates respective features for each boundary point of each region of each detection frame, thereby generating the features of the boundary points of each region of each detection frame.

Since the feature map of the image obtained by stitching in step S201 is also one sixteenth of the original size, N =16 is used.

The boundary points of each region may be the four vertices of the region, or may be the four vertices of the region and several points between the two vertices. Since the region boundary can be distinguished by four vertices, the boundary point in the embodiment of the present invention may include only four vertices of the region.

The characteristic graphs corresponding to the detection frames are subjected to bilinear interpolation-based downsampling to obtain characteristic graphs with the same size corresponding to the detection frames, and then the position information of the detected target is determined according to the characteristic graphs with the same size, so that the existing rounding processing can be avoided, the deviation between the detected rectangular frame of the target such as a pedestrian and a vehicle and the real rectangular frame attached with the boundary is avoided, the deviation between the characteristic graphs subjected to final classification and regression is also avoided, the obtained position information of the detected target is accurate, and the detection accuracy is improved.

Step S204: and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

Specifically, the features of the feature maps with the same size corresponding to each detection frame can be extracted through two cascaded full-connection layers, a classification layer and a regression layer are respectively connected to the back of the second full-connection layer of the two cascaded full-connection layers, the classification layer is used for judging the category corresponding to each detection frame, namely the category of people, automobiles or people riding or backgrounds, and the like, and the accurate position of each detection frame is output through the regression layer.

Fig. 3 is a schematic diagram of a main flowchart of a target detection method using bilinear interpolation according to another embodiment of the present invention.

As shown in fig. 3, the target detection method using bilinear interpolation according to the embodiment of the present invention mainly includes the following steps S301 to S304.

Step S301: and extracting features of the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image.

Step S302: and generating a plurality of detection frames corresponding to each position of the feature map of the image so as to obtain the feature map corresponding to each detection frame.

Step S303: and respectively carrying out ROI (region of interest) downsampling and downsampling based on bilinear interpolation on the feature map corresponding to each detection frame to obtain the feature map with the same size corresponding to each detection frame.

Step S303 specifically includes: performing ROI (region of interest) downsampling on the feature map corresponding to each detection frame to obtain a first feature map with the same size corresponding to each detection frame; carrying out bilinear interpolation-based down-sampling on the feature map corresponding to each detection frame to obtain a second feature map with the same size corresponding to each detection frame; and splicing the first characteristic diagram with the same size and the second characteristic diagram with the same size corresponding to each detection frame into the characteristic diagram with the same size corresponding to each detection frame.

And the dimensions of the second characteristic diagram with the same size and the first characteristic diagram with the same size are both M1M 2, and the dimensions are the dimensions of the characteristic diagrams with the same size corresponding to all detection frames obtained by final splicing. And splicing the first feature map with the same size and the second feature map with the same size corresponding to each detection frame into the feature maps with the same size corresponding to each detection frame, wherein the feature maps are spliced in the dimension of a channel, that is, if the number of channels of the feature maps corresponding to each detection frame is M0 before downsampling is performed, the number of channels of the feature maps with the same size corresponding to each detection frame obtained by splicing is 2M0.

Performing bilinear interpolation-based downsampling on the feature map corresponding to each detection frame to obtain a second feature map with the same size corresponding to each detection frame, which may specifically include: averagely dividing the characteristic diagram corresponding to each detection frame into M1 × M2 areas; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; and selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a second feature map with the same size corresponding to each detection frame according to the selected features.

The method for generating the feature of the boundary point of each region of each detection frame through bilinear interpolation is described in detail in step S203 above, and is not described again.

The following describes a process of performing ROI downsampling on the feature map corresponding to each detection frame to obtain a first feature map with the same size corresponding to each detection frame.

For any possible detection frame on the original image, if the coordinates at the upper left corner are (X1, Y1) and the coordinates at the lower right corner are (X2, Y2), the size of the feature map obtained by the processing before step S303 is one-16 times smaller than that of the original image before ROI downsampling is performed, so that the new coordinates of the detection frame on the feature map become (X1/16, Y1/16) at the upper left corner and (X2/16, Y2/16) at the lower right corner, since X1, Y1, X2, and Y2 are likely not multiples of 16, X1/16, Y1/16, X2/16, and Y2/16 are often non-integers, and the positions of pixel points on the feature map are all integers. By ROI downsampling, if the four numbers of X1/16, Y1/16, X2/16 and Y2/16 have non-integer numbers, the numbers are rounded. Then M1 × M2 down-sampling is performed in the rounded coordinate range, that is, the M1 × M2 regions are averagely divided (the size of the region changes with the size of the detection frame), and the boundary points of the M1 × M2 regions are also processed in the rounding manner, so as to finally obtain the positions of the region boundary points. Then, the maximum value of the features of the region is selected in each region, and a new feature map representing the size of M1 × M2 of the corresponding detection frame is obtained. The characteristics of each region comprise the characteristics of the boundary point of the region and the characteristics of each pixel point in the region.

The features obtained by ROI downsampling are also helpful for the subsequent classification step, so that the feature maps corresponding to the detection frames are subjected to ROI downsampling and bilinear interpolation-based downsampling respectively to obtain the feature maps with the same size corresponding to the detection frames, and the position information of the detected target is determined according to the feature maps with the same size, so that the influence of the existing ROI downsampling on the detection accuracy is reduced, the obtained information in the feature maps is richer, and the accuracy of target detection is improved.

Step S304: and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

The specific contents of step S301, step S302 and step S304 refer to the descriptions of step S201, step S202 and step S204.

In another implementation of the present invention, the feature maps corresponding to the detection frames in step S203 and step S303 in the embodiments shown in fig. 2 and fig. 3 may be replaced by only performing ROI downsampling on the feature maps corresponding to the detection frames to obtain the feature maps with the same size corresponding to the detection frames, so that the above-mentioned technical effects of reducing the calculation amount of detection and meeting the real-time application requirements can be achieved.

The present invention further provides a target detection model, and the target detection method using bilinear interpolation according to the embodiment of the present invention may be implemented in the target detection model to implement target detection.

Fig. 4 is a main configuration diagram of an object detection model according to an embodiment of the present invention.

The object detection model shown in fig. 4 is constructed on the basis of the fast RCNN framework. Although the existing Faster RCNN framework shows good performance, the framework selects a VGG16 network to perform feature extraction on an image, and the VGG16 network has very many parameters and large calculation amount, so that the requirement of real-time application is difficult to meet. MobileNet is a convolutional neural network designed by google researchers and with a very small amount of calculation, and has a high frame rate, that is, many frames of images can be processed per second, and even more frames can be processed per second by MobileNet on the same platform compared with SqueezeNet. The embodiment of the invention constructs a new target detection model by utilizing the MobileNet and the Faster RCNN framework with very small calculated amount, and can be used for real-time target detection tasks of pedestrians, vehicles and the like.

As shown in fig. 4, the target detection model according to an embodiment of the present invention includes a network model based on MobileNet structure, a convolution layer, a first classification layer, a first regression layer, an ROI down-sampling layer, a down-sampling layer based on bilinear interpolation, a connection layer, a full connection layer, a second classification layer, and a second regression layer, where the convolution layer, the first classification layer, and the first regression layer are layers in a region-generated network. The detection target of the embodiment can be a person, a car, a person riding a bike, and the like.

The MobileNet passes through a plurality of down-sampling layers, the obtained feature map relu6/sep is thirty-half of the original image, and in the Faster RCNN framework, the input region generation network (comprising a convolution layer, a first classification layer and a first regression layer) and the feature map of the ROI down-sampling layer are one sixteenth of the size of the original image (namely the collected image). Therefore, the structure of the MobileNet is modified to construct a new network model in the embodiment of the present invention, and since the method for constructing a network model based on a MobileNet network has been introduced above, it is not described here any more.

The network model based on the MobileNet structure not only meets the requirements of the feature extraction part of the fast RCNN frame, but also obtains rich feature information, and can improve the subsequent classification and regression effects, thereby improving the accuracy of the final detection result.

And connecting an area generating network (corresponding to a dotted line frame shown in fig. 4) behind the network model constructed based on the MobileNet, wherein the area generating network comprises a convolution layer, a first classification layer and a first regression layer, and the convolution layer is respectively connected with the first classification layer and the first regression layer. The convolution layer is used for extracting features of the feature map of the image obtained by splicing. The first classification layer is used for judging whether the detection frame belongs to the foreground or the background. The first regression layer is used for correcting the position of the detection frame.

The method comprises the steps of inputting characteristic diagrams of collected images output by a detection frame of area generation network output and a network model constructed based on MobileNet into an ROI (region of interest) down-sampling layer and a down-sampling layer based on bilinear interpolation respectively to carry out ROI down-sampling and down-sampling based on bilinear interpolation, and finally splicing the characteristic diagrams (namely a first characteristic diagram with the same size and a second characteristic diagram with the same size) output by the two down-sampling layers through a connecting layer to obtain the characteristic diagrams with the same size corresponding to each detection frame.

The cascaded full-connection layer is formed by connecting two full-connection layers, the number of nodes of the two full-connection layers is reduced from original 4096 to 512, and due to the fact that the number of the nodes of the full-connection layers is cut, the calculated amount is reduced, and the target detection model can be used for real-time target detection.

The second classification layer is used for judging the category corresponding to each detection frame, namely the category of people, automobiles or people riding or backgrounds, and the second regression layer is used for determining the accurate position of each detection frame.

Before detecting a target based on the target detection model of the embodiment of the present invention, the target detection model needs to be trained. In the training stage, a back propagation algorithm is adopted for model learning, and a random gradient descent method is adopted for model parameter learning. Specifically, the truth labels of training samples are labeled before training, classification cost and regression cost are calculated according to the labeled truth labels (namely foreground or background and position information of a detection frame) and the output results of the first classification layer and the first regression layer for the first classification layer and the first regression layer during each training, the classification cost and regression cost are calculated according to the labeled truth labels (namely the category (human, automobile, human riding, and the like) or the background of a specific detection target and the specific position of each detection target) and the output results of the second classification layer and the second regression layer for the second classification layer and the second regression layer, the total Loss value (total Loss is the total cost and comprises the classification cost and the regression cost) is continuously reduced, the output values of the classification layer (the first classification layer and the second classification layer) and the regression layer (the first regression layer and the second regression layer) which are more accurate are finally obtained, the gradient descent is realized by continuously moving the Loss value to the opposite direction of the gradient corresponding to the current point to reduce the Loss, and only one gradient descent is calculated by adopting a random training algorithm, wherein each gradient descent is calculated.

It should be noted that the target detection model according to the embodiment of the present invention may also be constructed on the basis of the Fast RCNN framework.

Fig. 5 is a schematic diagram of the main blocks of an object detection apparatus using bilinear interpolation according to an embodiment of the present invention.

As shown in fig. 5, the target detection apparatus 500 using bilinear interpolation according to the embodiment of the present invention mainly includes: the device comprises a feature extraction module 501, a detection frame generation module 502, a down-sampling module 503 and a detection module 504.

The feature extraction module 501 is configured to extract features from the acquired image to obtain a feature map of the image.

In one embodiment, the feature extraction module 501 may be specifically configured to: and (3) extracting features of the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image.

Specifically, inputting an acquired image into a network model constructed based on a lightweight convolutional neural network;

down-sampling a feature map output by a first activation layer of the network model to obtain a first feature map;

the characteristic diagram output by a second activation layer of the network model is up-sampled to obtain a second characteristic diagram;

and splicing the first feature map, the second feature map and a third feature map output by a third activation layer of the network model into a feature map of the image.

The detection frame generating module 502 is configured to generate a plurality of detection frames corresponding to each position of the feature map of the image, so as to obtain a feature map corresponding to each detection frame.

The down-sampling module 503 is configured to down-sample the feature maps corresponding to the detection frames by using bilinear interpolation to obtain feature maps with the same size corresponding to the detection frames.

In one embodiment, the down-sampling module 503 may be specifically configured to:

averagely dividing the feature maps corresponding to the detection frames into M1M 2 areas, wherein M1 and M2 are used for representing the dimension of the feature maps with the same size, and the dimension is M1M 2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; and selecting the features corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining feature maps with the same size corresponding to each detection frame according to the selected features.

The characteristics of each region comprise the characteristics of boundary points of the region and the characteristics of all pixel points in the region.

In another embodiment, the down-sampling module 503 may be specifically configured to:

performing ROI (region of interest) downsampling on the feature maps corresponding to the detection frames to obtain first feature maps with the same size corresponding to the detection frames;

averagely dividing the feature map corresponding to each detection frame into M1M 2 areas, wherein M1 and M2 are used for representing the dimension of the feature map with the same size corresponding to each detection frame, and the dimension is M1M 2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a second feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary points of the region and the features of each pixel point in the region;

and splicing the first characteristic graphs with the same size and the second characteristic graphs with the same size corresponding to the detection frames into the characteristic graphs with the same size corresponding to the detection frames, wherein the dimensions of the second characteristic graphs with the same size and the first characteristic graphs with the same size are both M1M 2.

The downsampling module 503 may include a boundary point feature generation submodule for:

respectively representing the coordinate of a boundary point of a certain area in a certain detection frame by (X/N, Y/N), wherein N represents that the acquired image is a multiple of a feature map of the image, and generating the feature of the boundary point by a bilinear interpolation method as follows:

finding out a maximum integer A1 smaller than X/N and a minimum integer A2 larger than X/N, a maximum integer B1 smaller than Y/N and a minimum integer B2 larger than Y/N;

according to the characteristic values of two points with coordinates of (A1, B1) and (A1, B2), interpolating to generate a first intermediate value;

interpolating to generate a second intermediate value according to the characteristic values of the two points with the coordinates of (A2, B1) and (A2, B2);

according to the first intermediate value and the second intermediate value, a third intermediate value is generated through interpolation, and the third intermediate value is used as the characteristic of the boundary point;

the above process generates features for each boundary point of each region of each detection frame, thereby generating features of the boundary points of each region of each detection frame.

The detection module 504 is configured to determine the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

In addition, the specific implementation contents of the object detection device using bilinear interpolation in the embodiment of the present invention have been described in detail in the above object detection method using bilinear interpolation, and therefore, the repeated contents will not be described here.

Fig. 6 illustrates an exemplary system architecture 600 of a target detection method using bilinear interpolation or a target detection apparatus using bilinear interpolation to which embodiments of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves as a medium for providing communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with a server 605, via a network 604, to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

601, 602, and 603. The backend management server may analyze and process the received data such as the information query request, and feed back a processing result (e.g., push information) to the terminal device.

It should be noted that the target detection method using bilinear interpolation provided in the embodiment of the present invention may be executed by the server 605 or the

terminal devices

601, 602, and 603, and accordingly, a target detection apparatus using bilinear interpolation may be disposed in the server 605 or the

terminal devices

601, 602, and 603.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing a terminal device or server of an embodiment of the present application. The terminal device or the server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program executes the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a feature extraction module 501, a detection box generation module 502, a downsampling module 503, a detection module 504. The names of these modules do not constitute a limitation to the module itself in some cases, for example, the feature extraction module 501 may also be described as a "module for extracting features from an acquired image to obtain a feature map of the image".

As another aspect, the present invention also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: extracting features of the collected image to obtain a feature map of the image; generating a plurality of detection frames corresponding to each position of the feature map of the image to obtain the feature map corresponding to each detection frame; utilizing bilinear interpolation to perform down-sampling on the feature maps corresponding to the detection frames to obtain feature maps with the same size corresponding to the detection frames; and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

According to the technical scheme of the embodiment of the invention, the characteristic graphs corresponding to the detection frames are down-sampled by utilizing bilinear interpolation to obtain the characteristic graphs with the same size corresponding to the detection frames; and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame. The obtained position information of the detection target can be more accurate, and therefore the detection accuracy is improved. In addition, the light-weight convolutional neural network is used for extracting features of the acquired image to obtain a feature map of the image, and the detected calculated amount can be reduced, so that the real-time application requirement is met.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target detection method using bilinear interpolation, comprising:

extracting features of the collected image to obtain a feature map of the image;

generating a plurality of detection frames corresponding to each position of the feature map of the image to obtain the feature map corresponding to each detection frame;

performing ROI (region of interest) downsampling on the feature maps corresponding to the detection frames to obtain first feature maps with the same size corresponding to the detection frames; averagely dividing the feature maps corresponding to the detection frames into M1 × M2 areas, wherein M1 and M2 are used for representing the dimension of the feature maps with the same size, and the dimension is M1 × M2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a second feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary points of the region and the features of each pixel point in the region; splicing a first feature map with the same size and a second feature map with the same size corresponding to each detection frame into the feature maps with the same size corresponding to each detection frame, wherein the scales of the second feature map with the same size and the first feature map with the same size are both M1 × M2;

and determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

2. The method of claim 1, wherein the step of extracting features from the captured image to obtain a feature map of the image comprises:

and extracting features of the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image.

3. The method of claim 2, wherein the step of extracting features from the acquired image by using a lightweight convolutional neural network to obtain a feature map of the image comprises:

inputting the acquired image into a network model constructed based on the lightweight convolutional neural network;

down-sampling a characteristic diagram output by a first activation layer of the network model to obtain a first characteristic diagram;

and splicing the first feature map, the second feature map and a third feature map output by a third activation layer of the network model into the feature map of the image.

4. The method according to claim 1, wherein the step of generating the feature of the boundary point of each region of each detection frame by bilinear interpolation comprises:

and generating respective features for each boundary point of each region of each detection frame through the above process, so as to generate the features of the boundary points of each region of each detection frame.

5. An object detection apparatus using bilinear interpolation, comprising:

the characteristic extraction module is used for extracting characteristics of the acquired image to obtain a characteristic diagram of the image;

the detection frame generation module is used for generating a plurality of detection frames corresponding to each position of the feature map of the image so as to obtain the feature map corresponding to each detection frame;

the down-sampling module is used for carrying out ROI down-sampling on the feature maps corresponding to the detection frames to obtain first feature maps with the same size corresponding to the detection frames; averagely dividing the feature map corresponding to each detection frame into M1 × M2 areas, wherein M1 and M2 are used for representing the dimension of the feature map with the same size, and the dimension is M1 × M2; generating the characteristics of the boundary point of each area of each detection frame through bilinear interpolation; selecting the feature corresponding to the maximum feature value from the features of each region of each detection frame, and obtaining a second feature map with the same size corresponding to each detection frame according to the selected features, wherein the features of each region comprise the features of the boundary points of the region and the features of each pixel point in the region; splicing a first feature map with the same size and a second feature map with the same size, which correspond to each detection frame, into a feature map with the same size and corresponding to each detection frame, wherein the dimensions of the second feature map with the same size and the first feature map with the same size are both M1 × M2;

and the detection module is used for determining the position information of each detection target according to the feature maps with the same size corresponding to each detection frame.

6. The apparatus of claim 5, wherein the feature extraction module is further configured to:

7. The apparatus of claim 6, wherein the feature extraction module is further configured to:

8. The apparatus of claim 5, wherein the downsampling module comprises a boundary point feature generation submodule configured to:

respectively representing a boundary point coordinate of a certain area in a certain detection frame by (X/N, Y/N), wherein N represents that the acquired image is a multiple of the feature map of the image, and generating the feature of the boundary point by a bilinear interpolation method as follows:

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-4.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.