CN111898539A

CN111898539A - Multi-target detection method, device, system, equipment and readable storage medium

Info

Publication number: CN111898539A
Application number: CN202010754181.2A
Authority: CN
Inventors: 赵盼; 李军; 林昱; 张庆; 温悦
Original assignee: Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Current assignee: Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-11-06

Abstract

The invention discloses a multi-target detection method, a device, a system, equipment and a readable storage medium, wherein the multi-target detection method comprises the following steps: acquiring feature maps of an input image under multiple scales; inputting the feature maps of multiple scales into a multi-scale feature fusion layer for fusion to respectively obtain the prediction positions and prediction categories corresponding to the targets of the input image under multiple scales; and merging the prediction positions and the prediction types corresponding to the targets to obtain the target positions and the target types. By implementing the method and the device, the multi-target detection capability is improved when the scale range is large, complex feature extraction operation is not needed, and the real-time performance of target detection is ensured.

Description

Multi-target detection method, device, system, equipment and readable storage medium

Technical Field

The invention relates to the technical field of automatic driving, in particular to a multi-target detection method, a multi-target detection device, a multi-target detection system, multi-target detection equipment and a readable storage medium.

Background

Object detection is an important task in the field of automatic driving. The main targets detected fall into two categories: a stationary object and a moving object. Stationary objects such as traffic lights, traffic signs, etc., moving objects such as vehicles, pedestrians, non-motorized vehicles, etc. The accuracy and speed of target detection are all inadequate, and low detection accuracy or excessive calculation delay can cause fatal danger.

The real-time detection algorithm based on the yolo series has good detection real-time performance and accuracy, and therefore, the real-time detection algorithm is widely adopted and improved. However, the network is an end-to-end full convolution structure, a deeper network hierarchy can be realized by using a large number of layer-hopping connections, certain achievement is achieved in the aspect of real-time target detection, and the detection precision is insufficient; although the detection accuracy can be improved by increasing the resolution of the input image, the amount of calculation is also increased significantly, and the real-time performance cannot be satisfied. In practical application, multiple types of targets need to be detected simultaneously, the difference of pixel scales of the targets among the types is large, and for complex scenes with large pixel scale ranges, the existing detection algorithm cannot meet the requirements of detection precision and real-time performance simultaneously.

Disclosure of Invention

Therefore, the technical problem to be solved by the present invention is to overcome the defect that the detection algorithm in the prior art cannot take into account both the detection accuracy and the real-time property, thereby providing a multi-target detection method, apparatus, system, device and readable storage medium.

According to a first aspect, an embodiment of the present invention provides a multi-target detection method, including: acquiring feature maps of an input image under multiple scales; inputting the feature maps of multiple scales into a multi-scale feature fusion layer for fusion to respectively obtain the prediction positions and prediction categories corresponding to the targets of the input image under multiple scales; and merging the predicted positions and the predicted types corresponding to the targets to obtain the target positions and the target types.

With reference to the first aspect, in a first implementation manner of the first aspect, the acquiring feature maps of multiple scales of the input image includes: inputting the input image into a convolutional neural network, extracting image shallow feature data of the input image, and embedding the image shallow feature data into a high-dimensional feature space to obtain corresponding first features; and inputting the first feature into a residual error neural network, and performing downsampling processing and deep embedding processing on the first feature for different times to obtain a plurality of feature maps with different scales.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the inputting the first feature into a residual neural network, and performing downsampling processing and deep embedding processing on the first feature for different times to obtain a plurality of feature maps with different scales includes: inputting the first feature into a first residual error neural network, and performing down-sampling processing and deep nonlinear embedding processing on the first feature for a first preset number of times to obtain a first feature map; inputting the first feature map into a second residual error neural network, and performing downsampling processing and deep embedding processing on the first feature map for a second preset number of times to obtain a second feature map, wherein the scale of the second feature map is larger than that of the first feature map; and inputting the second feature map into a third residual error neural network, and performing downsampling processing and deep embedding processing on the second feature map for a third preset number of times to obtain a third feature map, wherein the scale of the third feature map is larger than that of the second feature map.

With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the inputting the feature maps of multiple scales into a multi-scale feature fusion layer for fusion to obtain the prediction positions and prediction categories corresponding to the targets of the input image under multiple scales includes: inputting the first feature map, the second feature map and the third feature map into a first multi-scale feature fusion layer, a second multi-scale feature fusion layer and a third multi-scale feature fusion layer respectively for multi-scale feature fusion to obtain a first fusion map, a second fusion map and a third fusion map correspondingly; and inputting the first fusion map, the second fusion map and the third fusion map into a first convolution combination neural network, a second convolution combination neural network and a third convolution combination neural network respectively for high-dimensional nonlinear embedding to obtain a first prediction position and a first prediction type corresponding to the first fusion map, a second prediction position and a second prediction type corresponding to the second fusion map and a third prediction position and a third prediction type corresponding to the third fusion map.

With reference to the third embodiment of the first aspect, in a fourth embodiment of the first aspect, before inputting the second feature map into the second multi-scale feature fusion layer for fusion, the method further includes: and performing first splicing on the third fusion graph and the second feature graph output by the second residual error neural network to generate a first spliced graph, and taking the first spliced graph as the input of a second multi-scale feature fusion layer.

With reference to the fourth embodiment of the first aspect, in the fifth embodiment of the first aspect, before inputting the first feature map into the first multi-scale feature fusion layer for fusion, the method further includes: and performing second splicing on the second fusion graph and the first feature graph output by the first residual error neural network to generate a second spliced graph, and taking the second spliced graph as the input of the first multi-scale feature fusion layer.

With reference to the third implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the merging the predicted positions and the predicted categories corresponding to the targets to obtain the target positions and the target categories includes: converting the first prediction position and the first prediction category into a first matrix through a matrix transformation algorithm; converting the second prediction position and the second prediction category into a second matrix through the matrix transformation algorithm; converting the third predicted position and the third predicted category into a third matrix through the matrix transformation algorithm; merging the first matrix, the second matrix, and the third matrix into a target matrix; and obtaining the target position and the target type according to the target matrix.

With reference to the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the obtaining a target position and a target category according to the target matrix includes: screening the prediction positions and prediction types corresponding to the targets contained in the target matrix through a preset algorithm; and obtaining the target position and the target category according to the screened target matrix.

According to a second aspect, an embodiment of the present invention provides a multi-target detection apparatus, including: the acquisition module is used for acquiring feature maps of the input image under multiple scales; the fusion module is used for inputting the feature maps of the multiple scales into a multi-scale feature fusion layer for fusion to respectively obtain the prediction positions and the prediction categories corresponding to the targets of the input image under the multiple scales; and the merging module is used for merging the predicted positions and the predicted types corresponding to the targets to obtain the target positions and the target types.

According to a third aspect, an embodiment of the present invention provides a multi-target detection system, including: the image data acquisition module is used for acquiring input images, generating an input image sequence according to acquisition time, creating an input image queue and inputting the acquired input images into the input image queue frame by frame; an input preprocessing module, configured to obtain the input image from the input image queue, perform preprocessing on the input image to obtain a preprocessed image, create a model image queue, and input the preprocessed image into the model image queue; the multi-target detection device is used for acquiring the preprocessed image from the model image queue, extracting the features of the preprocessed image to obtain feature maps of the preprocessed image in multiple scales, inputting the feature maps of the multiple scales into a multi-scale feature fusion layer for fusion to obtain a predicted position and a predicted category corresponding to each target of the preprocessed image in the multiple scales, merging the predicted position and the predicted category corresponding to each target, creating an output queue, inputting the predicted position and the predicted category into the output queue, and screening the predicted position and the predicted category to obtain the target position and the target category; the output module is used for creating a detection result output queue and inputting the target output result into the detection result output queue; and the format output interface module is used for carrying out format adjustment on the target position and the target type in the detection result output queue and outputting target format data.

According to a fourth aspect, an embodiment of the present invention provides a computer apparatus, including: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, and the processor executing the computer instructions to perform the multi-target detection method according to the first aspect or any embodiment of the first aspect.

According to a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the multi-target detection method described in the first aspect or any implementation manner of the first aspect.

The technical scheme of the invention has the following advantages:

1. according to the multi-target detection method and device, the characteristic graphs of the input image under multiple scales are obtained, the characteristic graphs of the multiple scales are input into the multi-scale characteristic fusion layer to be fused, the prediction positions and the prediction types corresponding to the targets of the input image under the multiple scales are obtained respectively, and the prediction positions and the prediction types corresponding to the targets are combined to obtain the target positions and the target types. By acquiring the feature maps under multiple scales, the feature maps of multiple scales are ensured to be acquired at the same time, the problem that the target detection precision is insufficient due to large pixel scale difference when the feature maps of multiple scales are formed by zooming the input image is avoided, and the multi-target detection capability is improved when the scale range is large; meanwhile, the feature maps of multiple scales are input into the multi-scale feature fusion layer to be fused, the prediction positions and the prediction types corresponding to the targets of the input images under the multiple scales are obtained respectively, complex feature extraction operation is not needed, and the real-time performance of target detection is guaranteed.

2. The multi-target detection system provided by the invention comprises an image data acquisition module, an input preprocessing module, a multi-target detection device, an output module and a format output interface module, wherein the modules are separated, and are linked by using a data queue, so that the modules with different functions are decoupled, and troubleshooting is easy to perform; meanwhile, the image data input into the preprocessing module, the multi-target detection device and the output module are subjected to parallel computation, and the real-time performance of system operation is guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a multi-target detection method in an embodiment of the invention;

FIG. 2 is a multi-scale feature fusion layer constructed based on pooling operations of different kernel sizes in an embodiment of the present invention;

FIG. 3 is a multi-scale feature fusion layer constructed by general convolution operations based on different kernel sizes according to an embodiment of the present invention;

FIG. 4 is a multi-scale feature fusion layer formed by hole convolution operations based on different kernel sizes according to an embodiment of the present invention;

FIG. 5 is a multi-scale feature fusion layer formed by a plurality of feature extraction operations based on different kernel sizes according to an embodiment of the present invention;

FIG. 6 is another flow chart of a multi-target detection method in an embodiment of the present invention;

FIG. 7 is another flow chart of a multi-target detection method in an embodiment of the present invention;

FIG. 8 is another flow chart of a multi-target detection method in accordance with an embodiment of the present invention;

FIG. 9 is a functional block diagram of a multi-target detection arrangement in an embodiment of the present invention;

FIG. 10 is a functional block diagram of a multi-target detection system in an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a computer device in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

Based on the problem of insufficient detection precision when the target pixel scale range is large in a complex scene in the prior art, the embodiment provides a multi-target detection method, which is applied to multi-target detection in an automatic driving scene, as shown in fig. 1, and the multi-target detection method comprises the following steps:

and S11, acquiring feature maps of the input image at multiple scales.

Illustratively, the input image is an image which is acquired by the camera equipment and contains an object to be detected. The camera device may be a USB camera or other form of imaging device. The scale is the image size or pixel scale of the input image and can be divided into a larger pixel scale, a medium pixel scale and a smaller pixel scale. And inputting the acquired input image into a feature extraction network for feature extraction to obtain feature maps under multiple scales.

And S12, inputting the feature maps of multiple scales into the multi-scale feature fusion layer for fusion, and respectively obtaining the prediction positions and the prediction types corresponding to the targets of the input image under multiple scales.

Illustratively, the multi-scale fusion layer is used for performing feature fusion on the corresponding feature maps of the input images. The multi-scale feature fusion layer may be formed by pooling operations based on different kernel sizes, which includes a jump connection and pooling operations of a plurality of different kernel sizes, then the plurality of feature layers are spliced along the channel dimension, and the number of channels is restored to the number of single-scale channels before multi-scale fusion by convolution, as shown in fig. 2; the method can also be formed by common convolution operations based on different kernel sizes, including a jump connection and a plurality of common convolution operations with different kernel sizes, then splicing a plurality of feature layers along the channel dimension, and recovering the channel number to the single-scale channel number before multi-scale fusion through convolution, as shown in fig. 3; the method can also be formed by hole convolution operations based on different kernel sizes, and comprises a jump connection and a plurality of hole convolution operations with different kernel sizes, then a plurality of feature layers are spliced along the channel dimension, and the number of channels is restored to the number of single-scale channels before multi-scale fusion through convolution, as shown in fig. 4; or a plurality of feature extraction operations based on different kernel sizes, including a jump connection and a plurality of feature extraction operations of different kernel sizes, then concatenating a plurality of feature layers along the channel dimension, and restoring the channel number to the single-scale channel number before the multi-scale fusion by convolution, as shown in fig. 5. The method is not limited in this application, and can be determined by those skilled in the art according to actual needs.

The predicted position is the position of the target contained in the input image; the prediction category is a category to which the target included in the input image belongs, such as a location of a traffic light. And respectively inputting the feature maps under multiple scales to the corresponding multi-scale feature fusion layers for multi-scale fusion, inputting the fused feature maps to a convolutional neural network, and acquiring the prediction positions and prediction types of all targets contained in the feature maps corresponding to the scales.

And S13, merging the predicted positions and the predicted types corresponding to the targets to obtain the target positions and the target types.

Illustratively, the prediction positions and the prediction categories of the targets contained in the obtained feature maps of the scales are spliced and merged, each row of vectors after being merged is used as a group of target positions and target categories, and the target positions and the target categories of the targets contained in the feature maps of the scales are further determined according to the spliced and merged prediction positions and prediction categories.

In the multi-target detection method provided by this embodiment, feature maps of multiple scales are input into a multi-scale feature fusion layer for fusion by obtaining feature maps of an input image under multiple scales, prediction positions and prediction categories corresponding to targets of the input image under multiple scales are obtained respectively, and the prediction positions and prediction categories corresponding to the targets are combined to obtain the target positions and the target categories. By acquiring the feature maps under multiple scales, the feature maps of multiple scales are ensured to be acquired at the same time, the problem that the target detection precision is insufficient due to large pixel scale difference when the feature maps of multiple scales are formed by zooming the input image is avoided, and the multi-target detection capability is improved when the scale range is large; meanwhile, the feature maps of multiple scales are input into the multi-scale feature fusion layer to be fused, the prediction positions and the prediction types corresponding to the targets of the input images under the multiple scales are obtained respectively, complex feature extraction operation is not needed, and the real-time performance of target detection is guaranteed.

As an alternative implementation, as shown in fig. 6, the step S11 includes:

and S111, inputting the input neural network into the convolutional neural network, extracting image shallow feature data of the input image, and embedding the image shallow feature data into a high-dimensional feature space to obtain a corresponding first feature.

Illustratively, a convolutional neural network consists of convolution (3 x 3 kernel), batch normalization (bn), and nonlinear activation (e.g., leakyrelu) for high-dimensional feature extraction on the input neural network. Shallow feature extraction is carried out on the obtained input image through convolution combination, and input data are embedded into a high-dimensional feature space through channel expansion to obtain first features corresponding to the input image. For example, the input size of the input image is (416, 3), and the output size is (416, 32), i.e. the first feature corresponding to the input image is (416, 32).

And S112, inputting the first feature into a residual error neural network, and performing downsampling processing and deep embedding processing on the first feature for different times to obtain a plurality of feature maps with different scales.

Illustratively, the residual neural network comprises a down-sampling module and a residual module, wherein the down-sampling module can be implemented by pooling, common convolution or void convolution, and in consideration of implementation difficulty and the disadvantage of discarding pooling characteristics, the down-sampling module can be implemented by using common convolution, and the convolution step size is set to be 2; the residual error module can adopt a standard residual error form, so that the situation that the gradient of the network disappears is avoided. And inputting the first features output by the convolutional neural network into a residual neural network, and performing downsampling processing and deep embedding processing for different times to obtain a plurality of feature maps corresponding to different scales. Wherein, the downsampling processing of different times corresponds to the feature maps of different scales.

Further, the step of inputting the first feature into a residual error neural network, and performing downsampling processing and deep embedding processing on the first feature for different times to obtain a plurality of feature maps with different scales includes:

first, a first feature is input into a first residual error neural network, and downsampling processing and deep nonlinear embedding processing are performed on the first feature for a first preset number of times to obtain a first feature map.

Illustratively, the first feature is input into a first residual neural network to perform downsampling processing and deep nonlinear embedding processing for 3 times, so as to obtain a first feature map. The first characteristic diagram is represented by high-latitude characteristics and has a small number of predicted grids. For example, the first feature of the input first residual neural network is (416, 32), and the sizes of the output first feature map obtained after the processing by the first residual neural network are (208, 64), (104,104,128) and (52, 256), respectively. The first feature map may be used for detecting a smaller scale object.

And secondly, inputting the first feature map into a second residual error neural network, and performing downsampling processing and deep embedding processing on the first feature map for a second preset number of times to obtain a second feature map, wherein the scale of the second feature map is larger than that of the first feature map.

Illustratively, a first feature map which is output by a first residual neural network and is represented by high-dimensional features is input into a second residual neural network, and downsampling processing and deep feature embedding processing are carried out for 1 time to obtain a second feature map, wherein the scale of the second feature map is larger than that of the first feature map, and the second feature map can be suitable for detecting a target with a medium scale. For example, the size of the first feature map input to the second residual neural network is (52, 256), and the size of the second feature map output after passing through the second residual neural network is (26, 512).

And thirdly, inputting the second feature map into a third residual error neural network, and performing downsampling processing and deep embedding processing on the second feature map for a third preset number of times to obtain a third feature map, wherein the scale of the third feature map is larger than that of the second feature map.

Exemplarily, a second feature map which is output by a second residual neural network and is represented by high-dimensional features is input into a third residual neural network, and downsampling processing and deep feature embedding processing are carried out for 1 time to obtain a third feature map, wherein the scale of the third feature map is larger than that of the second feature map, and the third feature map can be suitable for detecting a target with a larger scale. For example, the size of the second feature map input to the third residual neural network is (26, 512), and the size of the third feature map output after passing through the third residual neural network is (13, 1024).

As an alternative implementation, as shown in fig. 7, the step S12 includes:

and S121, inputting the first feature map, the second feature map and the third feature map into the first multi-scale feature fusion layer, the second multi-scale feature fusion layer and the third multi-scale feature fusion layer respectively for multi-scale feature fusion, and correspondingly obtaining a first fusion map, a second fusion map and a third fusion map.

For an exemplary specific description of the multi-scale feature fusion layer, reference is made to the detailed description corresponding to the above embodiment, and details are not repeated here. Taking the third feature map as an example, inputting the third feature map to the third multi-scale feature fusion layer for multi-scale feature fusion to obtain a third fusion map, so that the third fusion map is more adaptive to the change of the target scale, and the feature size of the input third feature map is not changed by the third fusion map output by the third multi-scale feature fusion layer. If the size of the third feature map input to the third multi-scale feature fusion layer is (13, 1024), the size of the third fusion map output through the third multi-scale feature fusion layer is (13, 1024). The generation method of the first fusion graph and the second fusion graph is consistent with the generation method of the third fusion graph, and the description is omitted here.

And S122, inputting the first fusion map, the second fusion map and the third fusion map into the first convolution combination neural network, the second convolution combination neural network and the third convolution combination neural network respectively for high-dimensional nonlinear embedding, and obtaining a first prediction position and a first prediction type corresponding to the first fusion map, a second prediction position and a second prediction type corresponding to the second fusion map and a third prediction position and a third prediction type corresponding to the third fusion map.

Illustratively, the method for obtaining the first predicted position and the first predicted category corresponding to the first fusion map, the second predicted position and the second predicted category corresponding to the second fusion map, and the third predicted position and the third predicted category corresponding to the third fusion map is similar, and here, the third predicted position and the third predicted category are taken as an example. And inputting the third fusion graph into a third convolution combination neural network for high-dimensional nonlinear embedding. When the output size of the third fused graph is variable during the high-dimensional nonlinear embedding, in this embodiment, the output size of the third fused graph is not changed, that is, the output size of the third fused graph input to the third convolution combination neural network is (13, 1024), and the output size after the high-dimensional nonlinear embedding processing is (13, 1024). After the high-dimensional nonlinear embedding processing is performed on the third convolution combination neural network, the number of channels required for target detection of the third fusion graph can be obtained, which is described by taking 3 × 15 as an example in this embodiment. Wherein, 3 is a target scale, 3 scales representing each output fusion graph, and 15 is a prediction vector including 1 probability prediction value of whether the target is, 4 prediction bounding box values (central point coordinate values x and y, a wide prediction value w and a high prediction value h), and 10 category prediction probability values. The categories may include, but are not limited to, automobiles, trucks, buses, motorcycles, bicycles, trains, people, riders, traffic signs, and traffic lights. If the input third fused graph size is (13,13,1024), the output size should be (13,13,45), including 13 × 13 predicted nets.

And similarly, inputting the second fusion graph into a second convolution combination neural network for high-dimensional nonlinear embedding. And after high-dimensional nonlinear embedding processing is carried out on the second convolution combination neural network, the number of channels required by target detection of the second fusion graph can be obtained. If the second fused graph size of the input second convolution combination neural network is (26, 512), the output size after the high-dimensional nonlinear embedding processing is (26, 512), and the output size comprises 26 × 26 prediction networks.

And similarly, inputting the first fusion graph into the first convolution combination neural network for high-dimensional nonlinear embedding. After the high-dimensional nonlinear embedding processing is carried out on the first convolution combination neural network, the number of channels required by the target detection of the first fusion graph can be obtained. If the first fused graph size of the input first convolution neural network is (52, 256), the output size after the high-dimensional nonlinear embedding processing is (52, 256), and 52 × 52 prediction networks are included.

As an optional implementation, before inputting the second feature map into the second multi-scale feature fusion layer for fusion, the method further includes: and performing first splicing on the third fusion graph and a second feature graph output by the second residual error neural network to generate a first spliced graph, and taking the first spliced graph as the input of the second multi-scale feature fusion layer.

Illustratively, the third fusion graph output by the third convolution combination neural network is subjected to 2 times of upsampling processing, then is spliced with the second feature graph output by the second residual error neural network, and the output size of the third fusion graph is converted into the output size consistent with that of the second feature graph by using 1 × 1 convolution, so that the first splicing graph is obtained. For example, if the input third fused map size is (13, 1024), the input second feature map size is (26, 512), and the output first mosaic size is (26, 512).

As an optional implementation, before inputting the first feature map into the first multi-scale feature fusion layer for fusion, the method further includes: and performing second splicing on the second fusion graph and the first feature graph output by the first residual error neural network to generate a second splicing graph, and taking the second splicing graph as the input of the first multi-scale feature fusion layer.

Illustratively, the second fusion graph output by the second convolution combination neural network is subjected to 2 times of upsampling processing, then is spliced with the first feature graph output by the first residual error neural network, and the output size of the second fusion graph is converted into the output size consistent with that of the first feature graph by using 1 × 1 convolution, so that the second splicing graph is obtained. For example, the input second fused map size is (26, 512), the input first feature map size is (52, 256), and the output second stitched map size is (52, 256).

As an alternative implementation, as shown in fig. 8, the step S13 includes:

s131, converting the first prediction position and the first prediction category into a first matrix through a matrix transformation algorithm.

Illustratively, the first predicted location and the first prediction category may be represented by a first prediction matrix. The matrix transformation algorithm may be a reshape () function by which the number of rows, columns and dimensions of the first prediction matrix may be re-adjusted to convert the first prediction matrix to the first matrix (13 x 3, 15). The matrix transformation algorithm is not limited in the present application, and can be determined by those skilled in the art according to actual needs.

And S132, converting the second prediction position and the second prediction type into a second matrix through a matrix transformation algorithm.

The second prediction position and the first prediction category may illustratively be represented by a first prediction matrix. For a specific description of the matrix transformation algorithm, reference is made to the corresponding description of the above embodiments, and details are not repeated here. The second prediction matrix may be transformed into a second matrix (26 x 3,15) by a matrix transformation algorithm.

And S133, converting the third prediction position and the third prediction category into a third matrix through a matrix transformation algorithm.

Illustratively, the third predicted location and the third prediction category may be represented by a third prediction matrix. For a specific description of the matrix transformation algorithm, reference is made to the corresponding description of the above embodiments, and details are not repeated here. The third prediction matrix may be transformed into a third matrix (52 x 3,15) by a matrix transformation algorithm.

And S134, combining the first matrix, the second matrix and the third matrix into a target matrix.

Illustratively, the first matrix (13 × 3,15), the second matrix (26 × 3,15) and the third matrix (52 × 3,15) obtained by the matrix transformation algorithm are merged to generate a target matrix in the form of ((13 × 13+26 × 26+ 52) × 3, 15).

And S135, obtaining the target position and the target type according to the target matrix.

Exemplarily, each row vector in the target matrix is used as a group of target prediction items, and according to the target prediction items in the target matrix, the target position and the target category of each target included in each scale feature map are further determined.

As an optional implementation manner, the step S135 includes:

firstly, a preset algorithm is used for screening the prediction position and the prediction type corresponding to each target in a target matrix.

Illustratively, the preset algorithm is an object existence probability threshold method and a bounding box non-maximum value suppression method. Specifically, a prediction matrix corresponding to a certain input image is obtained, and threshold elimination is performed on a target existence probability value in the prediction matrix according to a target existence probability threshold method, for example, a target existence probability threshold is set to be 0.3, a value with a target existence probability value smaller than 0.3 is eliminated, and prediction items with higher target existence probabilities are left; and then, performing partial elimination on the other prediction items through the predicted central point position and width and height by using a bounding box non-maximum value inhibition method, for example, if a bounding box Intersection (IOU) threshold is 0.6, eliminating the prediction items smaller than 0.6.

And secondly, obtaining the target position and the target category according to the screened target matrix.

Illustratively, a target existence probability threshold method and a bounding box non-maximum value inhibition method are adopted to screen prediction items in a target matrix, and the screened and reserved prediction items are final prediction results, namely target positions and target types corresponding to targets in the input image.

Example 2

The present embodiment provides a multi-target detection apparatus, which can be used as a general module for multi-target detection in an automatic driving scenario, as shown in fig. 9, the multi-target detection apparatus includes:

and the obtaining module 21 is configured to obtain feature maps of the input image at multiple scales. For details, refer to the related description of step S11 corresponding to the above method embodiment, and are not repeated herein.

And the fusion module 22 is configured to input the feature maps of multiple scales into the multi-scale feature fusion layer for fusion, so as to obtain the prediction positions and prediction categories corresponding to the targets of the input image under multiple scales, respectively. For details, refer to the related description of step S12 corresponding to the above method embodiment, and are not repeated herein.

And the merging module 23 is configured to merge the predicted positions and the predicted categories corresponding to the targets to obtain target positions and target categories. For details, refer to the related description of step S13 corresponding to the above method embodiment, and are not repeated herein.

The multi-target detection device provided by this embodiment obtains feature maps of an input image in multiple scales, inputs the feature maps in the multiple scales into a multi-scale feature fusion layer for fusion, respectively obtains predicted positions and predicted categories corresponding to targets of the input image in the multiple scales, and combines the predicted positions and predicted categories corresponding to the targets to obtain the target positions and target categories. By acquiring the feature maps under multiple scales, the feature maps of multiple scales are ensured to be acquired at the same time, the problem that the target detection precision is insufficient due to large pixel scale difference when the feature maps of multiple scales are formed by zooming the input image is avoided, and the multi-target detection capability is improved when the scale range is large; meanwhile, the feature maps of multiple scales are input into the multi-scale feature fusion layer to be fused, the prediction positions and the prediction types corresponding to the targets of the input images under the multiple scales are obtained respectively, complex feature extraction operation is not needed, and the real-time performance of target detection is guaranteed.

As an optional implementation manner, the obtaining module 21 includes:

and the extraction submodule is used for inputting the input image into the convolutional neural network, extracting the image shallow feature data of the input image, and embedding the image shallow feature data into the high-dimensional feature space to obtain the corresponding first feature. For details, refer to the related description of step S111 corresponding to the above method embodiment, and are not repeated herein.

And the first processing submodule is used for inputting the first characteristic into the residual error neural network, and performing downsampling processing and deep embedding processing on the first characteristic for different times to obtain a plurality of characteristic graphs with different scales. For details, refer to the related description of step S112 corresponding to the above method embodiment, and are not repeated herein.

As an optional implementation, the first processing sub-module includes:

and the first processing subunit is used for inputting the first feature into the first residual error neural network, and performing downsampling processing and deep nonlinear embedding processing on the first feature for a first preset number of times to obtain a first feature map. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the second processing subunit is used for inputting the first feature map into a second residual error neural network, and performing downsampling processing and deep embedding processing on the first feature map for a second preset number of times to obtain a second feature map, wherein the scale of the second feature map is larger than that of the first feature map. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the third processing subunit is used for inputting the second feature map into a third residual neural network, and performing downsampling processing and deep embedding processing on the second feature map for a third preset number of times to obtain a third feature map, wherein the scale of the third feature map is larger than that of the second feature map. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

As an optional embodiment, the fusion module 22 includes:

and the fusion submodule is used for inputting the first feature map, the second feature map and the third feature map into the first multi-scale feature fusion layer, the second multi-scale feature fusion layer and the third multi-scale feature fusion layer respectively to perform multi-scale feature fusion, and correspondingly obtaining the first fusion map, the second fusion map and the third fusion map. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the convolution sub-module is used for inputting the first fusion graph, the second fusion graph and the third fusion graph into the first convolution combination neural network, the second convolution combination neural network and the third convolution combination neural network respectively for high-dimensional nonlinear embedding to obtain a first prediction position and a first prediction category corresponding to the first fusion graph, a second prediction position and a second prediction category corresponding to the second fusion graph and a third prediction position and a third prediction category corresponding to the third fusion graph. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

As an optional implementation, the multi-target detection apparatus further includes:

and the first splicing module is used for performing first splicing on the third fusion graph and a second feature graph output by the second residual error neural network to generate a first splicing graph, and taking the first splicing graph as the input of the second multi-scale feature fusion layer. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

and the second splicing module is used for performing second splicing on the second fusion graph and the first feature graph output by the first residual error neural network to generate a second splicing graph, and taking the second splicing graph as the input of the first multi-scale feature fusion layer. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

As an optional implementation, the merging module 23 includes:

and the first conversion submodule is used for converting the first prediction position and the first prediction category into a first matrix through a matrix conversion algorithm. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the second conversion sub-module is used for converting the second prediction position and the second prediction category into a second matrix through a matrix transformation algorithm. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the third conversion sub-module is used for converting the third prediction position and the third prediction category into a third matrix through a matrix transformation algorithm. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the merging submodule is used for merging the first matrix, the second matrix and the third matrix into a target matrix. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the determining submodule is used for obtaining the target position and the target category according to the target matrix. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

As an optional implementation, the determining sub-module includes:

and the screening subunit is used for screening the prediction position and the prediction category corresponding to each target in the target matrix through a preset algorithm. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

And the determining subunit is used for obtaining the target position and the target type according to the screened target matrix. For details, refer to the corresponding related description of the above method embodiments, and are not repeated herein.

Example 3

The present embodiment provides a multi-target detection system, which can be applied to an autonomous driving vehicle, and can detect a target in a forward road in real time to improve safety of autonomous driving, and the multi-target detection system is shown in fig. 10, and includes: an image data acquisition module 31, an input preprocessing module 32, a multi-object detection device 33, an output module 34, and a format output interface module 35.

The image data acquiring module 31 is configured to acquire an input image, generate an input image sequence according to acquisition time, create an input image queue, and input the acquired input image into the input image queue frame by frame.

For example, the image data acquisition module may employ an external camera device, such as a usb camera or other form of imaging device, on the autonomous vehicle host, and open the camera to capture the input image using an opencv tool or other scripting tool. And (4) generating an input image sequence according to the acquisition time of the external camera equipment by taking the size of the input image (1280,720) as a reference, establishing an input image queue, sending the acquired input image into the input image queue1 frame by frame, and waiting for the acquisition of the input image queue by the input preprocessing module 32.

The input preprocessing module 32 is configured to obtain an input image from the input image queue, preprocess the input image to obtain a preprocessed image, create a model image queue, and input the preprocessed image into the model image queue.

Illustratively, the input pre-processing module 32 may be run on the autonomous vehicle host CPU and may process the raw captured input images into the image format required by the multi-target detection device, including image scaling, edge filling, etc. The method comprises the steps of firstly acquiring an original input image from an input image queue of an image data acquisition module 31, then scaling the acquired input image according to the ratio of the relative long side (for example, 1280) of the image to the image size (referenced to 416) input by the multi-target detection device, then filling the vacancy of the short side and the image size input by the multi-target detection device with a pixel value of 0 or 128 or a data set mean value to obtain a processed preprocessed image, wherein the size of the preprocessed image is (416 ), creating a model image input queue2, putting the preprocessed image data into the model image input queue, and waiting for the multi-target detection device 33 to acquire the preprocessed image data.

The multi-target detection device 33 is configured to obtain a preprocessed image from the model image queue, perform feature extraction on the preprocessed image, obtain feature maps of multiple scales of the preprocessed image, input the feature maps of the multiple scales into the multi-scale feature fusion layer to perform fusion, obtain predicted positions and predicted types corresponding to each target of the preprocessed image under the multiple scales, merge the predicted positions and predicted types corresponding to each target, create an output queue, input the predicted positions and predicted types into the output queue, and perform screening processing on the predicted positions and predicted types in the output queue to obtain target positions and target types.

Illustratively, the preprocessed image is obtained from a model image input queue of the input preprocessing module 32, and is input to a multi-target detection device to perform forward calculation on a feature extraction network, so as to obtain feature layer sizes (13,13,1024), and large, medium and small output branches are sequentially calculated, so as to obtain output layer sizes (13,13,45), (26,26,45) and (52,52,45), respectively, where the 45-dimensional vector includes 15-dimensional vectors of 3 scales, and the 15-dimensional vector includes 4 bounding box center point positions and width and height prediction values (i.e., x value, y value, width prediction value w and height prediction value h), 1 target existence probability value and 10 category probability prediction values, and each output branch includes 3 scales, so that the large, medium and small 3 branches can obtain predicted output values under 9 scales. And converting the prediction matrixes corresponding to the three output branches into matrixes with corresponding dimensions, and merging the matrixes to obtain a target matrix ((13 × 13+26 × 26+ 52) × 3, 15). And creating an output queue3, adding the predicted value in the target matrix into the output queue, and waiting for the output module 34 to obtain the predicted value. The multi-target detection device 33 may operate in a GPU, an Xavier, a Jetson series, or an FPGA, and is not limited in this application.

And acquiring a target matrix of a certain image from the output queue, and then performing threshold elimination on the target existence probability value (for example, eliminating the prediction item with the probability value smaller than the threshold value of 0.3), and leaving the prediction item with higher probability. Then, for the rest of prediction items, partial elimination is performed by using a non-maximum suppression function through the predicted central point position and width and height (for example, elimination of the prediction item whose predicted value is less than the threshold value of the enclosing frame IOU by 0.6), and the rest of prediction items are the final target position and target category.

And the output module 34 is used for creating a detection result output queue and inputting the target output result into the detection result output queue.

For example, output module 34 may run on the host CPU of the autonomous vehicle, create a test result output queue4, place the final target location and target category in the test result output queue, and wait for the external device of format output interface module 35 to obtain.

And the format output interface module 35 is configured to perform format adjustment on the target position and the target category in the detection result output queue, and output target format data.

For example, since the formats of the detection results required by different external devices are different, the format output interface module 35 is required to perform format adjustment on the detection results to adapt to the format requirements of different external devices. After the detection result is obtained from the detection result output queue of the output module 34, the detection result is put into an interface queue in a json file, txt file or binary stream format, for example, and the external device is waited to obtain the detection result.

The multi-target detection system provided by the embodiment comprises an image data acquisition module, an input preprocessing module, a multi-target detection device, an output module and a format output interface module, wherein the modules are separated, and are linked by using a data queue, so that the modules with different functions are decoupled, and troubleshooting is easy; meanwhile, the image data input into the preprocessing module, the multi-target detection device and the output module are subjected to parallel computation, and the real-time performance of system operation is guaranteed.

Example 4

The embodiment provides a computer device, which is applied to multi-target detection in an automatic driving scene, as shown in fig. 11, the device includes a processor 41 and a memory 42, wherein the processor 41 and the memory 42 may be connected through a bus or in other ways, and fig. 11 takes the connection through the bus as an example.

The processor 41 may be a Central Processing Unit (CPU). The Processor 41 may also be other general-purpose processors, Digital Signal Processors (DSPs), Graphics Processing Units (GPUs), embedded Neural Network Processors (NPUs), or other dedicated deep learning coprocessors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof.

The memory 42, which is a non-transitory computer readable storage medium, may be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (the obtaining module 21, the fusing module 22, and the merging module 23 shown in fig. 9) corresponding to the multi-target detection method in the embodiment of the present invention. The processor 41 executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions and modules stored in the memory 42, that is, implements the multi-target detection method in the above method embodiments.

The memory 42 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 41, and the like. Further, the memory 42 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 42 may optionally include memory located remotely from processor 41, which may be connected to processor 41 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 42 and, when executed by the processor 41, perform the multi-target detection method in the embodiment shown in FIGS. 1-8.

The method comprises the steps of inputting feature maps of multiple scales into a multi-scale feature fusion layer for fusion by obtaining feature maps of input images under multiple scales, obtaining prediction positions and prediction types corresponding to targets of the input images under multiple scales respectively, and combining the prediction positions and the prediction types corresponding to the targets to obtain target positions and target types. By acquiring the feature maps under multiple scales, the feature maps of multiple scales are ensured to be acquired at the same time, the problem that the target detection precision is insufficient due to large pixel scale difference when the feature maps of multiple scales are formed by zooming the input image is avoided, and the multi-target detection capability is improved when the scale range is large; meanwhile, the feature maps of multiple scales are input into the multi-scale feature fusion layer to be fused, the prediction positions and the prediction types corresponding to the targets of the input images under the multiple scales are obtained respectively, complex feature extraction operation is not needed, and the real-time performance of target detection is guaranteed.

The details of the computer device can be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 10, and are not described herein again.

Embodiments of the present invention further provide a non-transitory computer storage medium, where computer-executable instructions are stored, and the computer-executable instructions may execute the multi-target detection method in any of the above method embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard disk (Hard disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A multi-target detection method, comprising:

acquiring feature maps of an input image under multiple scales;

inputting the feature maps of multiple scales into a multi-scale feature fusion layer for fusion to respectively obtain the prediction positions and prediction categories corresponding to the targets of the input image under multiple scales;

and merging the predicted positions and the predicted types corresponding to the targets to obtain the target positions and the target types.

2. The method of claim 1, wherein the obtaining a feature map of the input image at a plurality of scales comprises:

inputting the input image into a convolutional neural network, extracting image shallow feature data of the input image, and embedding the image shallow feature data into a high-dimensional feature space to obtain corresponding first features;

and inputting the first feature into a residual error neural network, and performing downsampling processing and deep embedding processing on the first feature for different times to obtain a plurality of feature maps with different scales.

3. The method of claim 2, wherein inputting the first feature into a residual neural network, and performing different times of downsampling processing and deep embedding processing on the first feature to obtain a plurality of feature maps with different scales comprises:

inputting the first feature into a first residual error neural network, and performing down-sampling processing and deep nonlinear embedding processing on the first feature for a first preset number of times to obtain a first feature map;

inputting the first feature map into a second residual error neural network, and performing downsampling processing and deep embedding processing on the first feature map for a second preset number of times to obtain a second feature map, wherein the scale of the second feature map is larger than that of the first feature map;

and inputting the second feature map into a third residual error neural network, and performing downsampling processing and deep embedding processing on the second feature map for a third preset number of times to obtain a third feature map, wherein the scale of the third feature map is larger than that of the second feature map.

4. The method according to claim 3, wherein the inputting the feature maps of the multiple scales into a multi-scale feature fusion layer for fusion to obtain the predicted positions and the predicted categories corresponding to the targets of the input image under the multiple scales respectively comprises:

inputting the first feature map, the second feature map and the third feature map into a first multi-scale feature fusion layer, a second multi-scale feature fusion layer and a third multi-scale feature fusion layer respectively for multi-scale feature fusion to obtain a first fusion map, a second fusion map and a third fusion map correspondingly;

and inputting the first fusion map, the second fusion map and the third fusion map into a first convolution combination neural network, a second convolution combination neural network and a third convolution combination neural network respectively for high-dimensional nonlinear embedding to obtain a first prediction position and a first prediction type corresponding to the first fusion map, a second prediction position and a second prediction type corresponding to the second fusion map and a third prediction position and a third prediction type corresponding to the third fusion map.

5. The method of claim 4, further comprising, prior to inputting the second feature map into a second multi-scale feature fusion layer for fusion:

and performing first splicing on the third fusion graph and the second feature graph output by the second residual error neural network to generate a first spliced graph, and taking the first spliced graph as the input of a second multi-scale feature fusion layer.

6. The method of claim 5, further comprising, prior to inputting the first feature map into a first multi-scale feature fusion layer for fusion:

and performing second splicing on the second fusion graph and the first feature graph output by the first residual error neural network to generate a second spliced graph, and taking the second spliced graph as the input of the first multi-scale feature fusion layer.

7. The method according to claim 4, wherein the merging the predicted positions and the predicted categories corresponding to the targets to obtain the target positions and the target categories comprises:

converting the first prediction position and the first prediction category into a first matrix through a matrix transformation algorithm;

converting the second prediction position and the second prediction category into a second matrix through the matrix transformation algorithm;

converting the third predicted position and the third predicted category into a third matrix through the matrix transformation algorithm;

merging the first matrix, the second matrix, and the third matrix into a target matrix;

and obtaining the target position and the target type according to the target matrix.

8. The method of claim 7, wherein the deriving the target location and the target class from the target matrix comprises:

screening the prediction positions and prediction types corresponding to the targets contained in the target matrix through a preset algorithm;

and obtaining the target position and the target category according to the screened target matrix.

9. A multi-target detection apparatus, comprising:

the acquisition module is used for acquiring feature maps of the input image under multiple scales;

the fusion module is used for inputting the feature maps of the multiple scales into a multi-scale feature fusion layer for fusion to respectively obtain the prediction positions and the prediction categories corresponding to the targets of the input image under the multiple scales;

and the merging module is used for merging the predicted positions and the predicted types corresponding to the targets to obtain the target positions and the target types.

10. A multi-target detection system, comprising:

the image data acquisition module is used for acquiring input images, generating an input image sequence according to acquisition time, creating an input image queue and inputting the acquired input images into the input image queue frame by frame;

an input preprocessing module, configured to obtain the input image from the input image queue, perform preprocessing on the input image to obtain a preprocessed image, create a model image queue, and input the preprocessed image into the model image queue;

the multi-target detection device is used for acquiring the preprocessed image from the model image queue, extracting the features of the preprocessed image to obtain feature maps of the preprocessed image in multiple scales, inputting the feature maps of the multiple scales into a multi-scale feature fusion layer for fusion to obtain a predicted position and a predicted category corresponding to each target of the preprocessed image in the multiple scales, merging the predicted position and the predicted category corresponding to each target, creating an output queue, inputting the predicted position and the predicted category into the output queue, and screening the predicted position and the predicted category in the output queue to obtain a target position and a target category;

the output module is used for creating a detection result output queue and inputting the target output result into the detection result output queue;

and the format output interface module is used for carrying out format adjustment on the target position and the target type in the detection result output queue and outputting target format data.

11. A computer device, comprising: a memory and a processor communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the multi-target detection method of any one of claims 1-8.

12. A computer-readable storage medium storing computer instructions for causing a computer to perform the multi-target detection method of any one of claims 1-8.