CN111461221A

CN111461221A - Multi-source sensor fusion target detection method and system for automatic driving

Info

Publication number: CN111461221A
Application number: CN202010250235.1A
Authority: CN
Inventors: 汪洋; 丁丽琴; 孙晨阳; 张珊; 张泽环; 曾奕欣
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2020-07-28
Anticipated expiration: 2040-04-01
Also published as: CN111461221B

Abstract

The invention relates to the technical field of target detection, in particular to a multisource sensor fusion target detection method and system for automatic driving, which are mainly applied to the field of automatic driving three-dimensional target detection and improve the precision of three-dimensional target detection on the premise of not increasing the time complexity and the space complexity of an algorithm as much as possible. The target detection method comprises the following steps: the image information to be detected and the point cloud information are acquired by the sensor respectively, the information is extracted and fused by the feature extraction network provided by the invention, the possible position of the target is determined, and the target and the background are classified by a DRPN (serial regional nomination network) network, so that the classification precision is improved as much as possible. And further judging by utilizing a regression network to obtain the category information of the target. By the feature extraction network, the regional nomination network and the classification regression network, the target category can be accurately identified, and the target detection accuracy is improved.

Description

Multi-source sensor fusion target detection method and system for automatic driving

Technical Field

The invention relates to the technical field of target detection, in particular to a multisource sensor fusion target detection method and system for automatic driving and computer equipment.

Background

With the rapid development of national economy, most cities have already moved to the construction of new and modern city transformation roads. The biggest characteristics of the modern city are: the road is spacious, the city is planned rationally as a whole, and the traffic management system is perfect. The traffic vehicle detection becomes a key basic component in an intelligent traffic system, has wide application prospect in the fields of traffic dispersion, auxiliary driving systems, road monitoring and the like, and can provide important clues and evidences for public security cases and traffic accident investigation. The traffic vehicle detection is one of target detection, and the target detection is to detect a corresponding target by extracting different characteristic information of a scene target by using technologies such as image processing, mode recognition, artificial intelligence and the like, is an important branch of computer vision and image processing, and is widely applied to the fields of unmanned driving and the like. However, various interferences, natural factors, human factors, and the like, such as occlusion, light intensity, shooting angle, and the like, may be applied to the target detection process, which may cause distortion of the target, and directly affect the target detection accuracy.

In recent years, with the development of deep neural networks, researches on aspects including image classification, target detection, semantic segmentation and the like have been remarkably advanced, particularly in the field of target detection, a two-dimensional target detection algorithm based on deep learning has been greatly improved in accuracy and real-time compared with a conventional feature-based machine learning method, and has a remarkable achievement in testing of public data sets such as KITTI, COCO and the like.

Therefore, researchers have proposed a method for detecting a three-dimensional target, which aims to acquire geometric information such as a position, a size, and an attitude of the target in a three-dimensional space. The existing three-dimensional target detection algorithm can be roughly divided into three types of vision, laser point cloud and multi-mode fusion according to different sensors. The vision method is widely used in the field of target detection due to its advantages of low cost, rich texture features, etc., and can be classified into monocular vision and binocular/depth vision according to the type of camera. The former has the key problem that depth information cannot be directly acquired, so that the positioning error of a target in a three-dimensional space is large, and the latter provides abundant texture information and accurate depth information. However, binocular/depth vision is more sensitive to factors such as illumination conditions, so that the deviation of depth calculation is easily caused to be compared with visual data, and laser point cloud data has accurate depth information and obvious three-dimensional space characteristics, and is also widely applied to three-dimensional target detection. However, the detection precision of the existing laser point cloud three-dimensional target is lower than that of a visual three-dimensional target, and the requirement of an actual scene cannot be met, so that a high-precision multi-source sensor target detection system needs to be designed.

Disclosure of Invention

The invention mainly solves the technical problem that the existing target detection accuracy is low, and provides a multisource sensor fusion target detection method and system for automatic driving and computer equipment aiming at the technical problem.

An automatic driving-oriented multi-source sensor fusion target detection method comprises the following steps:

acquiring image information and point cloud information to be detected;

inputting the image information and the point cloud information into a feature extraction network for feature extraction to respectively obtain image feature information and point cloud feature information;

inputting the image characteristic information and the point cloud characteristic information into a depth fusion network for characteristic fusion to obtain fused characteristic information;

inputting the fused feature information into a regional nomination network to classify the target and the background;

and further classifying the target input classification regression network to obtain the class information of the target.

In one embodiment, the feature extraction network includes one max-pooling layer disposed between the first and second convolutional layers, five convolutional layers, and one average-pooling layer disposed after the fifth convolutional layer.

In another embodiment, the inputting the image information and the point cloud information into a feature extraction network for feature extraction to obtain image feature information and point cloud feature information respectively includes:

providing a feature extraction network formed by a P L Net (parallel Net) network;

inputting the image information and the point cloud information into the feature extraction network to obtain image feature information and point cloud feature information;

in one embodiment, the P L Net network comprises no less than 10 residual modules, each residual module comprising 10-16 parallel arranged branches, each branch comprising a plurality of convolutional layers;

preferably, the P L Net network includes 16 residual modules, each residual module includes 13 branches, each branch includes three convolutional layers, wherein, the length and width of the first layer of filters are both 1, the number of filters is set to 9, the length and width of the second layer of filters are both 3, the number of filters is set to 9, the length and width of the third layer of filters are both 1, and the number of filters is set to 256.

In another embodiment, the inputting the image feature information and the point cloud feature information into a depth fusion network for feature fusion, and obtaining the fused feature information includes:

and performing primary and secondary fusion on the image characteristic information and the point cloud characteristic information by each layer when the image information and the point cloud information are subjected to characteristic extraction to obtain fused characteristic information.

In one embodiment, the local nomination network is a serial local nomination network comprising a first classification module and a first regressor and a second classification module and a second regressor;

inputting the fused feature information into a regional nomination network to classify targets and backgrounds comprises:

setting an IOU (interaction over Unit) value of the first classification module as any value in [0.45,0.55], wherein the IOU is a simple measurement standard, and a task of obtaining a prediction range (bounding boxes) in output can be measured by the IOU), preliminarily screening the scanning frames, preliminarily separating the scanning frames into a target and a background, and preliminarily adjusting the scanning frames judged as the target by using a first regressor; preferably, the value of the first classification module is set to 0.5, and the best effect is achieved;

and setting the IOU value of the second classification module to be any value in (0.55, 0.65), screening the scanning frame after the initial adjustment again, outputting the nomination frame judged as the target, and adopting a second regressor to adjust the scanning frame judged as the target again, wherein preferably, the value of the second classification module is set to be 0.6, and the effect is best.

In another embodiment, said inputting said object into a classification regression network for further classification to obtain class information of said object comprises;

inputting the re-adjusted nomination frame into a classification regression network to obtain the probability that each target in the nomination frame is in each category, and selecting the category with the maximum probability as the category of the target.

An autopilot-oriented multi-source sensor fusion target detection system comprising:

the information acquisition module is used for acquiring image information to be detected and point cloud information;

the feature extraction module is used for inputting the image information and the point cloud information into a feature extraction network for feature extraction to respectively obtain image feature information and point cloud feature information;

the feature fusion module is used for inputting the image feature information and the point cloud feature information into a depth fusion network for feature fusion to obtain fused feature information;

the regional nomination module is used for inputting the fused characteristic information into a regional nomination network so as to classify the target and the background;

and the classification regression module is used for inputting the target into a classification regression network for further classification so as to obtain the class information of the target.

A computer device, comprising:

a memory for storing a program;

a processor for implementing the object detection method as described above by executing the program stored in the memory.

A computer-readable storage medium characterized by comprising a program executable by a processor to implement the object detection method as described above.

According to the method and the system for detecting the fusion target of the multi-source sensor for automatic driving, based on the image information and the point cloud data, the image characteristic information and the point cloud characteristic information are obtained through characteristic extraction respectively, after the image characteristic information and the point cloud characteristic information are subjected to deep fusion, the target category can be accurately identified through the region nomination network and the classification regression network designed by the application, and the target detection accuracy is improved.

Drawings

FIG. 1 is a flowchart of a target detection method according to an embodiment of the present application;

FIG. 2 is a block diagram of a conventional three-dimensional object detection model;

FIG. 3 is a block diagram of a conventional residual block;

FIG. 4 is a block diagram of a network topology optimization module according to an embodiment of the present application;

fig. 5 is a schematic diagram of an optimized feature extraction network structure according to an embodiment of the present application;

FIG. 6 is a graph illustrating the effect of the IOU configuration on regression performance in accordance with an embodiment of the present application;

fig. 7 is a schematic structural diagram of a multi-source information deep fusion network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a network framework for deep fusion of image data and point cloud data according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of a detection system according to a detection method of the present application;

fig. 10 is a schematic structural diagram of a detection system according to an embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

In the embodiment of the invention, the target detection precision is improved by optimizing the structure of the feature extraction network and the regional nomination network in the existing three-dimensional target detection framework and combining the multi-source data deep fusion network. The detection method and system of the present application is not limited to three-dimensional object detection, but may also be applied to two-dimensional object detection, such as detection of image information.

In an intelligent traffic system, a traffic information sensor mainly comprising a camera is commonly used for acquiring information at present, but the detection area of the equipment is small, when light changes suddenly, the camera has the problems of underexposure and overexposure, a high-precision target detection result is difficult to provide under the condition only based on a sensing algorithm of the camera, and the camera is not easy to move and maintain after being installed. In recent years, laser radar is one of the most important sensors in automatic driving and is often applied to the field of target detection, the laser radar has the characteristics of being free from illumination and directly acquiring accurate three-dimensional information, but the generated data only covers an area without obstacles between a point cloud and a laser origin and information of each obstacle, and the other areas are not determined whether the obstacles exist or not, and the obstacle can be compensated by the information provided by a camera. The object detection method and system of the present application will be described in detail below with respect to a vehicle as an object.

The first embodiment is as follows:

referring to fig. 1, the present embodiment provides an automatic driving-oriented multi-source sensor fusion target detection method, including:

step 101: and acquiring image information and point cloud information to be detected.

In this embodiment, RGB image information is acquired through a camera to obtain image data, the original data size of the image is a matrix of 700 × 800 × 3, the value of the data is the gray value of the point, such as the image data input in the images 7 and 9, point cloud information is acquired through a laser radar to obtain point cloud data, the original data size of the point cloud data is a matrix of 700 × 800 × 5, such as the point cloud data input in the images 7 and 9, a KITTI data set (the KITTI data set is a data set used for evaluating the performance of computer vision technologies such as stereoscopic images, optical flow, visual ranging, 3D object detection and 3D tracking in a vehicle-mounted environment, and the whole data set consists of 389 images of stereoscopic images and optical flow images, 39.2 kilometer visual ranging sequences and images marked by more than 20 ten thousand 3D objects) which are already the result of the joint registration of the camera and the laser radar.

Step 102: and inputting the image information and the point cloud information into a feature extraction network for feature extraction to respectively obtain image feature information and point cloud feature information.

The model of the feature extraction network of the present embodiment is shown in fig. 5, and includes a maximum pooling layer disposed between the first convolution layer and the second convolution layer, five convolution layers, and an average pooling layer disposed after the fifth convolution layer. Fig. 3 is a block diagram of a conventional residual error module, fig. 4 is a structure of the first convolution layer in this embodiment, and the structure in parentheses in each of the rest convolution layers (Conv layers) is similar to the structure shown in the first convolution layer (fig. 3). The 13 following the bracket indicates 13 branches, which can also be understood as the number of packets corresponding to the convolutional layer, and the value following the 13 following the bracket indicates the number of layers of the convolutional layer, for example, the first convolutional layer includes three layers.

The main functions of the feature extraction network are as follows: completing the feature extraction of the original data. The input is a matrix of raw data, e.g. image data, and the output is characteristic data (in the form of a matrix), e.g. a matrix of image characteristic information. The matrix output by the input data through operations (matrix operations) such as pooling and convolution as shown in fig. 5 is called feature data.

The method comprises the steps of firstly establishing a feature extraction network formed by a P L Net network, and then inputting the image information and the point cloud information into the feature extraction network to obtain the image feature information and the point cloud feature information.

Generally, the feature extraction network includes a plurality of residual modules, for example, the residual module includes 16 residual modules, each of the staggered modules includes a plurality of branches, for example, 10 to 16 branches, each branch includes a plurality of convolutional layers, and different branch network structures and branch numbers have an effect on feature extraction.

The characteristic extraction network P L Net network comprises 16 residual modules, each residual module comprises 13 parallel branches, each branch comprises three convolutional layers, the length and the width of a convolutional kernel of a first convolutional layer are both 1, the number of convolutional layers is set to 9, the length and the width of a convolutional kernel of a second convolutional layer are both 3, the number of convolutional kernels is set to 9, the length and the width of a convolutional kernel of a third convolutional layer are both 1, and the number of convolutional kernels is set to 256, wherein the P L Net network is designed based on the optimization of a model topological structure of the characteristic extraction network, one path is changed into weighted stacking of thirteen same modules on the basis of a residual module of ResNet, so that the characteristic extraction network achieves higher detection accuracy under the condition that the calculation complexity and the parameters are equivalent, the parameter design mainly follows two principles, (1) the same space size of output is kept, and the super-parameter of the modules is also kept consistent, (2) when the space size is reduced by 2 times, the sampling result of the characteristic extraction network is kept as the traditional calculation result, and the calculation module does not change into the width of the traditional calculation module:

F(x)＝x+f(x)

wherein, F (x) is a feature vector output after passing through a residual error network; x is an input feature vector; and f (x) obtaining an output feature vector by the input feature vector through a residual error network transformation function.

The network topology structure optimization module structure is shown in figure 4, and the calculation output result is

Wherein, F (x) is a feature vector output after passing through a residual error network; x is an input feature vector; f (x): the input feature vector is subjected to a residual error network transformation function to obtain an output feature vector; c: cardinality of the topology, i.e., the number of branches; t is_i(x) The method comprises the following steps Transformation function, T, of each branch_i(x)＝w_ix_i(i ═ 1 … 13), where w_iIs the weight, x, of each branch_iIs to input high-dimensional feature vectors and divide the high-dimensional feature vectors into C groups of low-dimensional features, x_iRepresenting the ith set of low dimensional features.

Segmenting input feature vectors into low-dimensional features, i.e. data x_iAfter extracting characteristic information by convolution transformation, carrying out aggregation operation and distributing weight w to each branch_iThe neurons are no longer linear superposition of the previous layer of input features, but are weighted sums of the features after transformation T of the same structure, and finally connected in combination with shortcuts.

The comparison of network performance is shown in table 1 (performance parameter F L OPs (floating point indexes); C: the cardinality of the topology structure, i.e. the number of branches (d is the dimension of the feature vector)), and it can be seen from table 1 that training is performed by setting the cardinality C in the ResNet-50 network structure to 1,2,4,8,13,32, and the structure design ensures that the computation complexity and the model parameter are equivalent, and from the training result, when C is equal to 13, the verification error is reduced by 0.7%, and when the cardinality is increased by 32, the verification error is increased, and when the network depth is increased to 101 layers, the computation complexity is doubled, and the verification error is increased compared with the network structure error of cardinality 4,8,13,32, and the comprehensive table 1 can conclude that in the structure design of the increased cardinality, the computation complexity and the model parameter are ensured not to have large jump, i.e. the computation complexity is maintained at 3.8 × 10⁹Left and right, the parameter is kept at 23.4 × 10⁶Under the left and right conditions, the network structure with the increased cardinality is more effective than the network depth and width, the calculation complexity and the network parameters are greatly reduced, and the cache generated in the training process is also obviously reduced, so that the main network of the feature extraction module is determined to be a P L Net network improved on the basis of ResNet-50, and the cardinality of the topological structure is 13.

The three-layer convolution design of one branch of each residual error module is that the length of a first layer of convolution layer is 1 wide, the number of convolution kernels is set to be 9, the length of a second layer of convolution layer is 3 wide, the number of convolution kernels is set to be 9, the length of a third layer of convolution layer is 1 wide, the number of convolution kernels is set to be 256, 13 branches in total form a 13-path stacking residual error module, see fig. 4, a feature extraction module is formed by 50 layers of convolution layers of a P L Net network, and a structural block diagram is shown in fig. 5.

TABLE 1 network Performance comparison

Step 103: and inputting the image characteristic information and the point cloud characteristic information into a depth fusion network for characteristic fusion to obtain fused characteristic information.

The fusion technology is divided into early fusion and later fusion according to fusion stages. The early-stage fusion is that the fusion operation is directly carried out after the multi-source data is preprocessed, and then the feature extraction network is used for extracting the features. And the later stage fusion is to perform feature extraction on the multi-source data after preprocessing and perform fusion operation on the output feature data. A large amount of irrelevant noise interference exists in early-stage fusion data, and a large amount of characteristic information is lost in later-stage fusion data. Considering that neither original data nor feature data can be ignored in the fusion process, a multi-source data depth fusion framework is designed by combining a data fusion technology and is shown in fig. 7, image data in fig. 7 is image information, and point cloud data is point cloud information. Based on the structure, a depth fusion network framework based on image and point cloud data is designed by combining a feature extraction network of a three-dimensional target detection network and is shown in figure 8, and input image information (namely image data in the image) and point cloud information (namely point cloud data in the image) are subjected to one-time two-average fusion for each layer during feature extraction.

The depth data fusion operation is performed synchronously during the feature extraction process and after the feature extraction is completed, as shown in fig. 8. The deep fusion network has the main functions as follows: and finishing the fusion operation of two paths of data (image data and point cloud data). Because the fusion operation and the feature extraction operation are performed simultaneously, and each layer of the feature extraction operation is subjected to one-time and two-time fusion, the input and the feature extraction modules are the same, and the output is one path of data (one matrix) feature data, namely the data is input as the feature data of the subsequent operation.

Step 104: and inputting the fused feature information into a regional nomination network to classify the target and the background. And inputting the fused feature data into a regional nomination network to perform binary classification (foreground (objects are in the frame)/background (objects are not in the frame)) and regression operation of the nomination frame. The regional nomination network outputs the coordinates of a foreground frame and the probability value (output is in a matrix form) of the foreground, wherein the regression operation represents the coordinate adjustment of a preset frame to enable the preset frame to be more fit to a real frame (the frame is marked under supervision training and represented by the coordinates), and the classification operation outputs the probability value (screened by a preset threshold value) of the foreground belonging to the frame. The main functions of the regional nomination network are as follows: and carrying out primary screening on the preset frame to obtain a nomination frame which has a target and is further matched with the real frame.

The ROI operation in fig. 9 is to pool the nomination frames with different sizes into the same size (i.e. the same dimensional matrix) and input the nomination frames into the classification regression network.

Specifically, when the anchor frame is preset for classification, the IOU needs to be preset. The threshold value is used for screening the preset frame once, when the IOU of the preset frame and the IOU of the real frame are larger than the threshold value, the frame is used as a nomination frame for carrying out regression and output once, when the IOU of the preset frame and the IOU of the real frame are smaller than the threshold value, the frame is used as a background frame for deleting, and at the moment, the setting of the threshold value has important influence on the performance of the regional nomination network. The results of the present example, which evaluate the influence of the setting of the threshold value on the regression task through experiments, are shown in fig. 6. When the IOU threshold is set to 0.5, the curve is found to have the best regression effect within the range of 0.5-0.55 of the input IOU; similarly, when the IOU threshold is set to 0.6, the curve has the best regression effect in the range of 0.6-0.7 of the input IOU; when the IOU threshold is set at 0.7, the curve works best in the range of 0.65-0.75 for input IOU. It follows that the setting of a single threshold cannot satisfy the regression operation for all the pre-selected boxes, and that the regression works best only when the IOU between the pre-set box and the real box is near the threshold.

In this embodiment, based on the idea of an integration algorithm in machine learning, a DRPN Network (serial regional nomination Network) is designed, two layers of detection networks are integrated in one Network, data obtained by fusing point cloud data and image data is used as input of the DRPN Network, and the input is output as a nomination frame after two screening adjustments. The nomination network of this embodiment is a serial regional nomination network, and includes a first classification module, a first regressor, a second classification module, and a second regressor, and the values of the first classification module and the second classification module are set differently. Through multiple trials and analysis, the IOU value of the first classification module is set to be any value in [0.45,0.55], scanning frames are preliminarily screened, the scanning frames are preliminarily divided into a target and a background, and a first regressor is adopted to preliminarily adjust the scanning frames judged as the target; the IOU value of the second classification module is set to any value in (0.55, 0.65), the scanning frame after preliminary adjustment is filtered again, the nomination frame judged as the target is output, the scanning frame judged as the target is adjusted again by the second regressor, a result meeting the requirement can be output, however, only when the IOU value of the first classification module is set to 0.5 and the IOU value of the second classification module is set to 0.6, the classification regression effect is best, namely, in the embodiment, the fusion data firstly enter the first classification module, the threshold value of the first classification module is set to 0.5 to filter the scanning frame, the foreground (the overlapping area with the real target is more than 0.5) and the background (the overlapping area with any real target frame is not more than 0.1) are judged, and the positions of the scanning frames judged as the foreground are preliminarily adjusted by the first regressor are matched to make the scanning frame after adjustment more real, and setting the threshold value of the IOU of the second classification module to be 0.6, judging the foreground and the background in the same way, and carrying out secondary adjustment on the nomination frame judged as the foreground to output the nomination frame so as to enable the nomination frame to be more fit with a real frame. And finally, inputting the nomination frame into a second-stage detection module to generate a prediction frame. Experimental results show that the DRPN network can effectively improve the accuracy of model detection.

Step 105: and inputting the target into the classification regression network for further classification to obtain the class information of the target.

The main functions of the classification regression network are: and further performing multi-classification on the nomination frames containing the targets to obtain a specific target type corresponding to each nomination frame. And further performing regression operation on the nomination frame to enable the nomination frame to be fitted with the real frame. And outputting a final detection result (detection frame coordinates and probability), wherein the probability matrix is a probability value belonging to each category, and finally, the category with the highest probability is taken as the category (matrix form) of the target in the detection frame, for example, if the probability that the target in a certain detection frame is detected to be an automobile is 0.9, the probability that the target is a truck is 0.8, and the probability that the target is a bicycle is 0.7, the target in the detection frame is determined to be the automobile. Inputting the nomination boxes into a classification regression network to generate a prediction box, and outputting the category and the probability of each nomination box, wherein the category of a certain nomination box is identified as a car, the probability is 0.8, and the probability that the target is the car is 0.8. Experimental results show that the DRPN network can effectively improve the accuracy of model detection.

In this embodiment, the original data matrix composed of the image information and the point cloud information is respectively processed through the feature extraction network and the depth fusion network (because the depth fusion operation is performed during and after the feature extraction, and is performed simultaneously with the feature extraction operation) to obtain the feature data matrix. The foreground probability and the coordinate set of the nomination frame are obtained through the regional nomination network, the multi-classification probability value and the coordinate set of the detection frame are obtained through classification regression operation, the conversion from the nomination frame to the detection frame is completed, the identification category is finally output, and the detection precision of the target detection method of the experimental embodiment is greatly improved.

And the vehicle detection of this embodiment is taken as an example, the comparison of the performance of the model designed by the present invention and the existing latest three-dimensional target detection model is shown in table 2, and table 2 shows the comparison of the performance of the conventional three-dimensional detection network and the optimized three-dimensional detection network (KITTI data set, 120000 iterations). The performance comparison of the model designed by the invention and the Top-1 error and the Top-5 error of the existing latest three-dimensional target detection model under different six vehicle categories is given in table 2, and the comparison result shows that the indexes of the three-dimensional traffic vehicle detection method provided by the invention in Top-1 error and Top-5 error (Top-1 error (the number of samples with different correct marks and the best marks output by the model)/the total number of samples, namely the probability value that the category with the maximum output probability value is not the real category, Top-5 error (the number of samples with correct marks not in the first 5 best marks output by the model)/the total number of samples, namely the output probabilities of all categories are sorted, and the probability that the real category does not exist in the first five categories with the maximum probability) are all reduced, thereby showing that the detection method provided by the invention can effectively reduce the target detection error rate, the accuracy of visual inspection is improved.

TABLE 2

Example two:

referring to fig. 10, the present embodiment provides an automatic driving-oriented multi-source sensor fusion target detection system, which includes an information obtaining module 201, a feature extracting module 202, a feature fusion module 203, a region nomination module 204, and a classification regression module 205.

The information obtaining module 201 is configured to obtain image information to be detected and point cloud information.

The feature extraction module 202 is configured to perform feature extraction on the image information and the point cloud information input feature extraction network, so as to obtain image feature information and point cloud feature information, respectively.

The feature fusion module 203 is configured to input the image feature information and the point cloud feature information into a depth fusion network for feature fusion, so as to obtain fused feature information.

The region nomination module 204 is configured to input the fused feature information into a region nomination network to classify the target and the background.

The classification regression module 205 is used to further classify the target input into the classification regression network to obtain the class information of the target.

In the embodiment, the image information and the point cloud information are acquired by the information acquisition module 201, and the original data matrix composed of the image information and the point cloud information is respectively processed by the feature extraction module 202 and the feature fusion module 203 to obtain feature information data. The foreground probability and the coordinate set of the nomination frame are obtained from the characteristic information data through the region nomination module 204, the multi-classification probability value and the coordinate set of the detection frame are obtained through the operation of the classification regression module 205, the conversion from the nomination frame to the detection frame is completed, the identification category is finally output, and the detection precision of the target detection method of the experimental embodiment is greatly improved.

EXAMPLE III

The present embodiment provides a computer device, which includes a memory for storing a program and a processor for implementing the object detection method according to the first embodiment by executing the program stored in the memory.

Example four

A computer-readable storage medium, the storage medium comprising a program, the program being executable by a processor to implement the method of object detection as provided in the first embodiment.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims

1. The multisource sensor fusion target detection method for automatic driving is characterized by comprising the following steps:

acquiring image information and point cloud information to be detected;

2. The object detection method of claim 1, wherein the feature extraction network comprises one max-pooling layer disposed between the first convolutional layer and the second convolutional layer, five convolutional layers, and one average pooling layer disposed after the fifth convolutional layer.

3. The object detection method of claim 2, wherein inputting the image information and the point cloud information into a feature extraction network for feature extraction to obtain image feature information and point cloud feature information, respectively, comprises:

providing a P L Net network to form a feature extraction network;

the P L Net network comprises not less than 10 residual modules, each residual module comprises 10-16 branches arranged in parallel, and each branch comprises a plurality of convolutional layers.

4. The object detection method of claim 3, wherein the P L Net network includes 16 residual modes;

the residual error module comprises 13 parallel branch circuits, each branch circuit comprises three convolution layers, the length and the width of convolution kernels in the first convolution layer are both 1, the number of the convolution kernels is set to be 9, the length and the width of convolution kernels in the second convolution layer are both 3, the number of the convolution kernels is set to be 9, the length and the width of convolution kernels in the third convolution layer are both 1, and the number of the convolution kernels is set to be 256.

5. The target detection method of claim 1, wherein the inputting the image feature information and the point cloud feature information into a depth fusion network for feature fusion, and obtaining the fused feature information comprises:

6. The object detection method of claim 1, wherein the local nomination network is a serial local nomination network comprising a first classification module and a first regressor and a second classification module and a second regressor;

setting the IOU value of the first classification module to be any value in [0.45,0.55], primarily screening scanning frames, primarily separating the scanning frames into a target and a background, and primarily adjusting the scanning frames judged as the target by adopting a first regressor;

and setting the IOU value of the second classification module to be any value in (0.55, 0.65), screening the scanning frames after the initial adjustment again, outputting nomination frames judged as targets, and adjusting the scanning frames judged as targets again by adopting a second regressor.

7. The object detection method of claim 4, wherein said further classifying the object input classification regression network to obtain class information for the object comprises;

8. An autopilot-oriented multi-source sensor fusion target detection system, comprising:

9. A computer device, comprising:

a memory for storing a program;

a processor for implementing the object detection method of any one of claims 1-5 by executing the program stored by the memory.

10. A computer-readable storage medium characterized by comprising a program executable by a processor to implement the object detection method according to any one of claims 1 to 5.