CN111602138B

CN111602138B - Object detection system and method based on artificial neural network

Info

Publication number: CN111602138B
Application number: CN201980008366.4A
Authority: CN
Inventors: 蒋卓键; 陈晓智
Original assignee: SZ DJI Technology Co Ltd
Current assignee: Shenzhen Zhuoyu Technology Co ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2024-04-09
Anticipated expiration: 2039-10-30
Also published as: WO2021081808A1; CN111602138A

Abstract

An object detection system and method based on artificial neural network. The method comprises the following steps: acquiring a three-dimensional point cloud, and acquiring a first feature map of the three-dimensional point cloud by using a backbone neural network (S101); processing the first feature map using the attention branching neural network, and acquiring a second feature map, each position of the second feature map including a predicted attention coefficient corresponding to the position, the second feature map further being used to acquire a loss function of the target object, the loss function being used to update a network coefficient of the attention branching neural network (S102); a prediction result including position information of the target object is obtained from the second feature map (S103).

Description

Object detection system and method based on artificial neural network

Copyright declaration

The disclosure of this patent document contains material which is subject to copyright protection. The copyright is owned by the copyright owner. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office patent files or records.

Technical Field

The invention relates to the technical field of three-dimensional object detection and deep learning, and more particularly relates to an object detection system and method based on an artificial neural network.

Background

Safety is one of the most interesting problems in automatic driving. In the aspect of algorithm, accurate perception of the surrounding environment by the unmanned vehicle is a basis for ensuring safety, so that the accuracy of the algorithm is very important. In the unmanned process, the unmanned vehicle needs to detect surrounding three-dimensional objects. At present, a laser radar is mostly adopted to detect a three-dimensional object, and when the traditional detection method faces the situation that the detected three-dimensional object is partially shielded, the problem of poor detection effect can occur because the point cloud is shielded.

Therefore, how to improve the detection effect on the blocked three-dimensional object is a problem to be solved.

Disclosure of Invention

The invention provides an object detection system and method based on an artificial neural network, which can further improve the prediction effect on the shielded object compared with the prior art.

In a first aspect, there is provided a method of object detection based on an artificial neural network, the method comprising: acquiring a three-dimensional point cloud, and acquiring a first feature map of the three-dimensional point cloud by using a backbone neural network; processing the first feature map by using an attention branch neural network, and acquiring a second feature map, wherein the second feature map is further used for acquiring a loss function of the target object, and the loss function is used for updating network coefficients of the attention branch neural network; and obtaining a prediction result according to the second feature map, wherein the prediction result comprises the position information of the target object.

Optionally, the second feature map with the prediction attention coefficient generated at each position can be obtained through the first feature map determined according to the three-dimensional point cloud data of the blocked target object, after the prediction attention coefficient is corrected or updated according to the prediction attention coefficient and the attention loss function generated by the true value attention, the prediction attention coefficient of the visible part of the blocked target object is made to be higher than the prediction attention coefficient of the blocked part, and the visible part can be utilized to a greater extent in the prediction process, so that the position, the size and other information of the target object can be predicted more accurately.

It should be appreciated that the method for object detection based on the artificial neural network provided in the embodiments of the present application may be applied to the field of automatic driving of unmanned devices such as unmanned aerial vehicles or unmanned vehicles, and is used for predicting an obstacle (such as other vehicles, pedestrians, etc.) in the surrounding environment of the unmanned movable device, where the obstacle (i.e., the target object) may be an object that is partially blocked. According to the method provided by the embodiment of the application, the object detection model based on the artificial neural network, which can obtain the position information, the size information and the like of the target object according to the visible part information of the blocked target object, can be trained through the deep learning of the neural network. The object detection model can give more weight to the information of the visible part of the target object based on the attention mechanism in the trained process, namely, the object detection model is more sensitive to the information of the visible part, so that the object detection model can acquire the information of the target object more accurately according to the visible part of the target object in the subsequent prediction process.

With reference to the first aspect, in certain implementation manners of the first aspect, the processing the first feature map using the attention branching neural network and acquiring a second feature map includes: dividing a candidate frame for the first feature map; generating, by the attention branching neural network, predicted attention coefficients for candidate boxes of the first feature map, wherein a value of the predicted attention coefficient for each of the candidate boxes is determined from a sample feature map that matches the first feature map; and carrying out dot multiplication on the predicted attention coefficient and the first feature map to obtain the second feature map.

Optionally, before training the detection model, a sample library may be first established, where a sample feature map of the target object may be included, where the sample feature map includes true value attention coefficients, and illustratively, the true value attention coefficients corresponding to the visible portion of the target object in the sample feature map are higher than the true value attention coefficients of the occluded portion.

Alternatively, the sample feature map of the target object may be the same as the point cloud feature information of each part of the first feature map acquired by the detection model, and the difference is only that the attention coefficient generated for each position is different.

With reference to the first aspect, in certain implementation manners of the first aspect, the method further includes: comparing the predicted attention coefficient of the candidate frame of the second feature map with the attention coefficient of the truth frame in the sample feature map corresponding to the candidate frame; determining a result of an attention loss function according to the predicted attention coefficient and the true value attention coefficient when the confidence level of the predicted attention coefficient of the candidate frame and the true value attention coefficient of the true value frame is higher than a first threshold; updating the attention branch neural network coefficient according to the result of the attention loss function, so that the confidence coefficient of the attention branch neural network coefficient is higher than a second threshold.

Alternatively, the second threshold may be higher than the first threshold. In other words, by updating the predicted attention coefficient, the predicted attention coefficient and the true value attention coefficient corresponding thereto are more closely valued. The values of the first threshold and the second threshold may be flexibly set, which is not limited in the embodiment of the present application.

With reference to the first aspect, in certain implementation manners of the first aspect, the method further includes: and after updating the predicted attention coefficient of the second characteristic diagram, performing an exponential operation of taking a natural constant e on the updated predicted attention coefficient.

It should be understood that, after the e-exponentiation operation is performed on the updated predicted attention coefficient, the predicted attention coefficient corresponding to the visible portion of the target object may be distinguished from the predicted attention coefficient corresponding to the blocked portion more significantly, so as to highlight the information of the visible portion.

With reference to the first aspect, in certain implementation manners of the first aspect, the updating the predicted attention coefficient according to the result of the attention loss function includes: updating the predicted attention coefficient by a back propagation algorithm according to the result of the attention loss function.

With reference to the first aspect, in certain implementations of the first aspect, the attention loss function isWherein k is the number of feature points in the candidate frame, L _a Is smoothL1 loss function, m _k For the predicted attention coefficient, t _k Is the true value of the attention coefficient.

With reference to the first aspect, in certain implementation manners of the first aspect, the acquiring a three-dimensional point cloud, and acquiring a first feature map of the three-dimensional point cloud using a backbone neural network, includes: acquiring three-dimensional point cloud data of a shielded target object; dividing the three-dimensional point cloud data into three-dimensional networks and obtaining a plurality of three-dimensional space voxels; according to the point cloud density in each voxel, obtaining the point cloud characteristics of the voxels; and extracting the point cloud features by using the backbone neural network, and generating the first feature map.

With reference to the first aspect, in certain implementation manners of the first aspect, the generating, by the attention branching neural network, a predicted attention coefficient for a candidate box of the first feature map includes: the attention branching neural network generates the predicted attention coefficients by one or more of a convolution operation, a full join, and a variation of the convolution operation.

With reference to the first aspect, in certain implementation manners of the first aspect, the method further includes: performing object detection on the target object through the artificial neural network, and obtaining the three-dimensional position and the confidence coefficient of the feature map candidate frame corresponding to the visible part of the target object; sorting the confidence degrees, and selecting candidate frames with the confidence degrees higher than a third threshold value; and predicting the information of the target object according to the candidate frames with the confidence coefficient higher than a third threshold value.

It should be understood that, in the prediction process, candidate frames are screened according to the confidence, and the prediction result of the target object is determined according to the information of the candidate frames with higher confidence.

With reference to the first aspect, in certain implementations of the first aspect, the information of the target object includes a position and/or a size of the target object.

With reference to the first aspect, in certain implementation manners of the first aspect, the method further includes: and displaying a prediction result obtained according to the second characteristic diagram.

It should be understood that when predicting the position or size information of the target object, the method provided by the embodiment of the present application obtains a feature map with a predicted attention coefficient of the visible portion higher than that of the blocked portion, and obtains a prediction result of the target object according to the feature map, where the prediction result may be directly displayed by the display.

In a second aspect, a system for object detection based on an artificial neural network is provided, including at least one processor and a lidar, where the lidar is configured to obtain a three-dimensional point cloud; inputting the three-dimensional point cloud of the target object into the processor; the processor is used for carrying out three-dimensional grid division on the three-dimensional point cloud to obtain a plurality of voxels; the processor is further used for determining the point cloud characteristics of the corresponding position of each voxel according to the point cloud density in each voxel; the processor is further configured to extract the point cloud feature through a backbone network of the object detection model, and generate a first feature map of the target object; the processor is further configured to generate a predicted attention coefficient in the first feature map through an attention branch neural network of the object detection model; the processor is further used for calculating the result of the attention loss function according to the true attention coefficient and the predicted attention coefficient in the sample feature map by using the loss function branch neural network; the processor is further configured to update the predicted attention coefficient according to a result of the attention loss function, so that a predicted attention coefficient generated by a feature map portion corresponding to the visible portion of the target object in the second feature map is higher than a predicted attention coefficient of a feature map portion of the blocked portion of the target object; the processor is further configured to obtain a prediction result according to the visible part information of the target object, where the prediction result includes position information of the target object.

It should be understood that, by determining the first feature map according to the three-dimensional point cloud data of the occluded object, the second feature map with the prediction attention coefficient generated at each position can be obtained, and after correcting or updating the prediction attention coefficient according to the prediction attention coefficient and the attention loss function generated by the true value attention, the prediction attention coefficient of the visible part of the occluded object is made to be higher than the prediction attention coefficient of the occluded part, so that the visible part can be utilized to a greater extent in the prediction process, and the position, the size and other information of the object can be predicted more accurately.

With reference to the second aspect, in certain implementations of the second aspect, the processor is further configured to divide a candidate box into the first feature map; the processor is further configured to generate, by the attention branching neural network, a predicted attention coefficient for a candidate box of the first feature map; the processor is further configured to perform dot multiplication on the predicted attention coefficient and the first feature map, and obtain the second feature map.

Optionally, before training the detection system, a sample library may be first established, where a sample feature map of the target object may be included, where the sample feature map includes true value attention coefficients, and illustratively, the true value attention coefficients corresponding to the visible portion of the target object in the sample feature map are higher than the true value attention coefficients of the occluded portion.

With reference to the second aspect, in certain implementations of the second aspect, the processor is further configured to compare a predicted attention coefficient of a candidate box of the second feature map with an attention coefficient of a truth box in a sample feature map corresponding to the candidate box; the processor is further configured to determine a result of an attention loss function according to the predicted attention coefficient and the true attention coefficient of the candidate frame when a confidence level of the predicted attention coefficient and the true attention coefficient of the truth frame is higher than a first threshold; the processor is further configured to update the predicted attention coefficient according to a result of the attention loss function, so that a confidence level of the predicted attention coefficient is higher than a second threshold.

With reference to the second aspect, in certain implementation manners of the second aspect, when the processor is configured to update a predicted attention coefficient of the second feature map, the updated predicted attention coefficient is subjected to an exponential operation with a natural constant e.

With reference to the second aspect, in certain implementations of the second aspect, the processor is further configured to update the predicted attention coefficient by a back propagation algorithm according to a result of the attention loss function.

With reference to the second aspect, in certain implementations of the second aspect, the attention loss function isWherein k is the number of feature points in the candidate frame, L _a As a smoothL 1 loss function, m _k For the predicted attention coefficient, t _k Is the true value of the attention coefficient.

With reference to the second aspect, in certain implementations of the second aspect, the processor is further configured to acquire three-dimensional point cloud data of the occluded target object; the processor is further used for carrying out three-dimensional network division on the three-dimensional point cloud data and obtaining a plurality of three-dimensional space voxels; the processor is further configured to obtain a point cloud characteristic of each voxel according to a point cloud density in the voxel; the processor is further configured to extract the point cloud feature using the backbone neural network, and generate the first feature map.

With reference to the second aspect, in certain implementations of the second aspect, the processor is configured to generate, by the attention branching neural network, a predicted attention coefficient for a candidate box of the first feature map, including: the attention branching neural network generates the predicted attention coefficients by one or more of a convolution operation, a full join, and a variation of the convolution operation.

With reference to the second aspect, in some implementations of the second aspect, the processor is configured to perform object detection on the target object through the artificial neural network, and obtain a three-dimensional position and a confidence level of a feature map candidate frame corresponding to a visible portion of the target object; the processor is further configured to sort the confidence degrees, and select candidate frames with confidence degrees higher than a third threshold value; the processor is further configured to predict information of the target object according to the candidate frame with the confidence level higher than a third threshold.

With reference to the second aspect, in certain implementations of the second aspect, the information of the target object includes a position and/or a size of the target object.

With reference to the second aspect, in certain implementations of the second aspect, the system further includes a display, where the display is configured to display a prediction result obtained according to the second feature map.

Optionally, the system provided by the embodiment of the application may be applied to a mobile device in the unmanned field, where the mobile device may be an unmanned aerial vehicle or an unmanned vehicle. The movable equipment can acquire the three-dimensional point cloud of the shielded target object through the laser radar, and predicts the position and/or size information of the target object according to the visible part of the shielded object.

In a third aspect, a system for object detection based on an artificial neural network is provided, the system comprising a processing module and a receiving module, wherein the system is adapted to perform the method as described in any implementation of the first aspect.

In a fourth aspect, there is provided a computer storage medium having stored thereon a computer program which, when executed by a computer, causes the computer to perform the method provided by the first aspect.

In a fifth aspect, there is provided a chip system comprising at least one processor, wherein program instructions, when executed in the at least one processor, cause the method of any of the first aspects to be implemented.

In a sixth aspect, there is provided a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method provided by the first aspect.

The object detection method based on the artificial neural network can be applied to the automatic driving field of unmanned equipment such as unmanned aerial vehicles or unmanned vehicles and is used for predicting obstacles (such as other vehicles, pedestrians and the like) in the surrounding environment of the unmanned movable equipment, wherein the obstacles (namely target objects) can be partially-shielded objects. According to the method provided by the embodiment of the application, the object detection model based on the artificial neural network, which can obtain the position information, the size information and the like of the target object according to the visible part information of the blocked target object, can be trained through the deep learning of the neural network. The object detection model can give more weight to the information of the visible part of the target object based on the attention mechanism in the trained process, namely, the object detection model is more sensitive to the information of the visible part, so that the object detection model can acquire the information of the target object more accurately according to the visible part of the target object in the subsequent prediction process.

Drawings

Fig. 1 shows a schematic diagram of a scenario where an artificial neural network-based object detection method according to an embodiment of the present application is applied.

Fig. 2 shows a schematic flow chart of a method for object detection based on an artificial neural network according to an embodiment of the present application.

Fig. 3 shows a schematic flow chart of a method for object detection based on an artificial neural network according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a system for object detection based on an artificial neural network according to an embodiment of the present application.

Fig. 5 is a schematic diagram of another system for object detection based on an artificial neural network according to an embodiment of the present application.

Detailed Description

In order to facilitate understanding of the technical solutions provided by the embodiments of the present invention, some concepts related to the embodiments of the present invention are first described below.

1. Attention (attention) mechanism

In colloquial terms, the attention mechanism is to focus attention on important points, while ignoring other unimportant factors. For example, the attention mechanism is similar to the visual attention mechanism of human, and human vision can obtain a target area needing to be focused on, namely a focus of attention, by rapidly scanning a global image when facing the image, and then throw more attention resources into the area to acquire more detail information needing to be focused on, and inhibit other useless information. The determination of the importance level may depend on the application scenario, among other things. The attention mechanism is divided into spatial attention and temporal attention according to the application scene, the former is generally used for image processing, and the latter is generally used for natural language processing. Embodiments of the present application relate generally to spatial attention.

It should be appreciated that the method for object detection provided in the embodiments of the present application may be applicable in an autopilot scenario (as shown in fig. 1). Specifically, in the automatic driving process, the unmanned vehicle can acquire three-dimensional point cloud to detect the surrounding environment by using the laser radar, and detect three-dimensional objects in the surrounding environment, and when the detected three-dimensional objects are partially shielded, the point cloud is missed to cause missed detection, so that the detection effect is greatly reduced. In order to obtain a better detection result when the shielded object is detected, so that the position and the size of the shielded three-dimensional object can be predicted more accurately.

The method for detecting an object provided in the embodiments of the present application is further described below with reference to the accompanying drawings.

Fig. 2 shows a schematic flow chart of a method for object detection based on an artificial neural network according to an embodiment of the present application. Comprising the following steps.

S101, acquiring a three-dimensional point cloud, and acquiring a first feature map of the three-dimensional point cloud by using a backbone neural network.

The three-dimensional point cloud is three-dimensional point cloud data of the target object which is partially shielded.

It should be appreciated that prior to acquiring the three-dimensional point cloud data, a neural network detection model of the three-dimensional object is generated by a deep learning algorithm. In a training stage of the neural network detection model, the detection model can comprise a main network and network branches, wherein the main network can be used for receiving three-dimensional point cloud data and generating a feature map according to the three-dimensional point cloud data; the network branches can be used for calculating the loss functions of the network, wherein the loss functions are respectively related to the confidence level, the position and the attribute coefficients, and the loss functions can guide updating network parameters such as the attribute coefficients and the like, so that the neural network detection model can more accurately predict the position and the size of the target object according to the non-occluded part of the target object, and the neural network detection model has better prediction performance.

In one implementation, the neural network detection model may voxel the three-dimensional space of the target object and determine the point cloud characteristics for each voxel based on the point cloud density in that voxel, prior to generating the feature map from the point cloud data. The process may translate the point cloud data of the target object into dimensions that the neural network may receive.

For example, the neural network detection model may first perform three-dimensional network division processing on the point cloud of the target object in the xyz direction according to a certain resolution, so as to obtain voxels of a plurality of three-dimensional spaces; and then determining the point cloud characteristics corresponding to the voxel based on the point cloud density in the voxel. For a voxel with point cloud, calculating the point cloud density P in the voxel, and setting the point cloud characteristic of the position as P; for voxels with no point cloud present, its point cloud feature is set to 0.

In one implementation, the neural network detection model may extract point cloud features through a backbone network and generate a first feature map. The backbone network may be any network structure, and the size of the first feature map may be h×w, which in the embodiment of the present application is not limited to specific numerical values of H and W.

In one implementation, the data input to the neural network detection model is not limited to the point cloud data of the target object, but may be image information of the target object, such as RGB image information of the target object.

S102, processing the first feature map by using the attention branch neural network, and acquiring a second feature map, wherein the second feature map is also used for acquiring a loss function of the target object, and the loss function is used for updating network coefficients of the attention branch neural network.

In one implementation, the predicted attention coefficient may be generated for the candidate box of the first feature map by an attention branching neural network. The positions of the first feature map may refer to candidate frames obtained by dividing the first feature map, and the size of the candidate frames may be flexibly set according to needs, which is not limited in the present application.

In one implementation, the attention branching neural network may generate the respective predicted attention coefficients at various locations of the first feature map in a variety of ways, such as by convolution operations, full-join, and convolved variants (e.g., SPAS, STAR, SGRS, SPARSE, etc.).

In one implementation, the value of the predicted attention coefficient initially generated on the first feature map may be a preset default value; alternatively, the value of the initially generated predicted attention coefficient is an empirical value.

Optionally, before the neural network model training process of the detected object is performed, a sample library may be first established, where a sample feature map of the target object is included in the sample library, where the sample feature map may be divided into truth boxes, and the size of the truth boxes may be the same as the size of the candidate boxes in the second feature map.

Alternatively, the true value attention coefficients may be labeled in advance for each portion of the target object in the sample feature map. For example, for a portion where the target object is blocked, a positive number whose value of the true value attention coefficient is negative or less than 1 may be noted in advance; for the portion of the target object that is not occluded (i.e., the visible portion), the positive number whose value of the true value attention coefficient is greater than 1 may be noted in advance.

In one implementation, the natural logarithm e index may be taken for the true-value attention coefficient in the sample feature map, so that the information of the unobstructed portion is more prominent.

It should be appreciated that the object detection model training process provided in the embodiments of the present application is to enable the attention coefficient of the visible portion of the target object to be higher or much higher than the attention coefficient of the occluded portion of the target object. The higher the attention coefficient is, which means that the partial point cloud is the key point cloud of the prediction target object, and the higher the attention degree and the utilization degree of the partial point cloud are when the position or the size of the target object is predicted subsequently.

In one implementation, the second feature map may be obtained from the first feature map generated with the predicted attention coefficients. For example, the generated predicted attention coefficient may be dot multiplied by the first feature map by a least square method to obtain the second feature map. In other words, the second feature map is obtained after generating the corresponding predicted attention coefficients for each candidate box based on the first feature map, and the second feature map may also be understood as an attention feature map (i.e., an attention feature map).

It will be appreciated that since the predicted attention coefficients in the respective candidate boxes in the second feature map are default values or empirical values, it cannot be guaranteed that the predicted attention coefficients of the visible portion of the target object are satisfied to be higher or much higher than the predicted attention coefficients of the occlusion portion. In this case, the true value attention coefficient in the sample feature map is required to be used as a reference to correct and update the prediction attention coefficient in the second feature map, so that the confidence level of the prediction attention coefficient and the true value attention coefficient reaches a first threshold value.

In one implementation, the detected object neural network model may correct and update the predicted attention coefficient by an attention coefficient loss function. Specifically, toComparing the predicted attention coefficient in the candidate frame of the second feature map with the true attention coefficient in the truth frame in the sample feature map corresponding to the candidate frame; when the confidence coefficient of the predicted attention coefficient of the candidate frame and the true value attention coefficient of the true value frame is higher than a first threshold value, determining the result of the attention loss function according to the predicted attention coefficient and the true value attention coefficient; after the result of the attention loss function is calculated according to the predicted attention coefficient and the true value attention coefficient, the attention branch neural network coefficient is updated according to the structure of the attention loss function, so that the confidence of the attention branch neural network coefficient is higher than a second threshold. Wherein the attention loss function is Σ _n＝0 ^k [L _a (m _k ，t _k )]K is the number of feature points in the candidate frame, L _a To smooth the least-squares smoothL 1 loss function, m _k To predict the attention function, t _k Is the actual true value of the attention coefficient.

In one implementation, after the results of the attention loss function are calculated, the network coefficients of the attention-branching neural network may be corrected and updated using the back-propagation algorithm.

It will be appreciated that by correcting and updating the predicted attention coefficient in the second feature map with the result of the attention loss function, the predicted attention coefficient can be made closer to the value of the true value attention coefficient.

In one implementation, when the predicted attention coefficient of the second feature map is updated, an operation of taking the natural constant e index may be performed on the updated predicted attention coefficient so that the attention coefficient of the visible portion has a more significant difference from the attention coefficient of the occlusion portion to highlight the information of the visible portion.

It should be understood that, after correction and updating of the predicted attention coefficient in the second feature map, the attention coefficient corresponding to the visible portion of the target object in the second feature map is higher, so that the object detection neural network model is more sensitive to the information of the visible portion, and the position and the size of the whole target object are predicted by using the information of the visible portion to a greater extent.

It should be further understood that, after correcting and updating the predicted attention coefficient of the attention-branching neural network, the attention-branching neural network may generate a high predicted attention coefficient in the visible portion of the target object in the subsequent detection process of the target object, and generate a lower predicted attention coefficient in the blocked portion of the target object, that is, the attention-branching neural network may be more sensitive to the information of the visible portion of the target object after the training process, so that the object detection model focuses more on the information of the visible portion in the actual prediction process, so as to achieve the detection effect of promoting the blocked target object.

S103, obtaining a prediction result according to the second feature map, wherein the prediction result comprises the position information of the target object.

In one implementation, the neural network object detection model may obtain a prediction result according to the corrected or updated second feature map of the predicted attention coefficient, where the prediction result may include location information of the target object, and may also include information such as a size of the target object.

In one implementation manner, for the neural network object detection model trained through the training process, in an actual prediction process, point cloud data or image information data of a target object may be input into the detection model, and the detection model may screen out data or information belonging to a non-occluded part in the point cloud data or image information, and predict information such as a position or a size of the whole occluded target object according to the data or information of the non-occluded part.

In one implementation manner, the three-dimensional position and the confidence corresponding to the candidate frame of the detected object can be obtained through the neural network object detection model; after the candidate frames are ordered according to the confidence, a certain number of candidate frames can be screened out according to the order from high confidence to low confidence, wherein the confidence of the screened candidate frames can be higher than a third threshold value; and predicting the position and the size of the target object according to the screened candidate frame with higher confidence. The process of predicting the size or position of the whole object according to the point cloud data or the image information of some key parts of the target object through the deep learning algorithm of the neural network can refer to the existing flow, and will not be described herein.

It should be understood that the method for detecting an object based on an artificial neural network provided in the embodiments of the present application may be applied to a scenario in which an unmanned vehicle or an unmanned plane predicts information such as a position and a size of an obstacle existing in a surrounding environment in the field of automatic driving. In view of the conventional process of detecting an obstacle, after a feature map is generated according to information of the obstacle, the feature map is directly input into a branch network for calculating a loss function of position and confidence, that is, in this case, basically the same attention is given to an occluded part and a visible part of the obstacle, however, since the obstacle is partially occluded, valuable information for predicting the position, the confidence and the like of the obstacle is partially lost, which may result in poor detection effect. According to the object detection method based on the artificial neural network, the attention network branches are added into the artificial neural network, the attention network branches are trained to learn the visible parts of the obstacles accurately, and then the object detection model predicts the position, the size or the confidence level and other information of the obstacles by utilizing the key information of the visible parts, so that an unmanned vehicle or an unmanned vehicle can accurately know the distribution, the size and the like of the obstacles in the surrounding environment to make an accurate driving track.

In addition, the object detection method based on the artificial neural network can still detect by using the laser radar, does not need to use other sensors for fusion, and reduces hardware cost.

Fig. 3 shows a schematic flowchart of an object detection method based on an artificial neural network according to an embodiment of the present application. The process includes the following steps.

S201, inputting a point cloud.

It should be understood that the object detection method provided in the embodiment of the present application may also input an image of the target object, such as a GRB image.

S202, three-dimensional gridding.

The three-dimensional networking refers to performing three-dimensional grid division processing on the point cloud of the target object, namely voxelization of a three-dimensional space. Specifically, the space point cloud can be grid-divided according to a certain resolution in xyz three space coordinate directions, and voxels of a three-dimensional space are obtained.

In one implementation, the point cloud features are determined from the point cloud density in the voxel. For a voxel with a point cloud, calculating the density (marked as P) of the point cloud in the voxel, and setting the point cloud characteristic of the position as P; for voxels without point cloud, its point cloud feature is set to 0. The process may translate the point cloud data of the target object into dimensions that the neural network may receive.

S203, acquiring a first feature map through a backbone network.

The first feature map is a feature map of a target object point cloud.

In one implementation manner, the backbone network may be any network structure, and the size of the first feature map may be h×w, which in the embodiment of the present application is not limited to specific numerical values of H and W.

S204, attention coefficient related operation.

Wherein the operation of correlating the attention coefficients in the flow may include generating predicted attention coefficients corresponding to respective positions in the first feature map based on the first feature map. The positions of the first feature map may refer to candidate frames obtained by dividing the first feature map, and the size of the candidate frames may be flexibly set according to needs, which is not limited in the present application.

S205, an attention coefficient is obtained.

S206, obtaining a second characteristic diagram.

The second feature map is obtained after generating corresponding predicted attention coefficients for each candidate frame based on the first feature map, and the second feature map may also be understood as an attention feature map (i.e., an attention feature map).

In one implementation, the detected object neural network model may correct and update the predicted attention coefficient by an attention coefficient loss function. Specifically, comparing the predicted attention coefficient in the candidate frame of the second feature map with the true attention coefficient in the truth frame in the sample feature map corresponding to the candidate frame; when the confidence coefficient of the predicted attention coefficient of the candidate frame and the true value attention coefficient of the true value frame is higher than a first threshold value, determining the result of the attention loss function according to the predicted attention coefficient and the true value attention coefficient; after the result of the attention loss function is calculated according to the predicted attention coefficient and the true value attention coefficient, the attention branch neural network coefficient is updated according to the structure of the attention loss function, so that the confidence of the attention branch neural network coefficient is higher than a second threshold. Wherein the attention loss function is Σ _n＝0 ^k [L _a (m _k ，t _k )]K is the number of feature points in the candidate frame, L _a As a smoothL 1 loss function, m _k To predict the attention function, t _k Is the actual true value of the attention coefficient.

S207, obtaining a prediction result.

S208, confidence sequencing and threshold screening.

In one implementation manner, the three-dimensional position and the confidence corresponding to the candidate frame of the detected object can be obtained through the neural network object detection model; after the candidate frames are ordered according to the confidence, the candidate frames with certain data can be screened out according to the order from high confidence to low confidence; and predicting the position and the size of the target object according to the screened candidate frame with higher confidence. The process of predicting the size or position of the whole object according to the point cloud data or the image information of some key parts of the target object through the deep learning algorithm of the neural network can refer to the existing flow, and will not be described herein.

S209, obtaining a final prediction result of the target object.

The object detection method based on the artificial neural network, provided by the embodiment of the application, can be applied to the field of automatic driving in a scene of predicting the position, the size and other information of the obstacle existing in the surrounding environment, such as an unmanned vehicle or an unmanned plane. In view of the fact that in the conventional obstacle detection process, after a feature map is generated according to the information of the obstacle, the feature map is directly input into a branch network for calculating a loss function of position and confidence, and therefore, the blocked portion and the visible portion of the obstacle are given substantially the same attention, that is, in this case, since the obstacle is partially blocked, valuable information for predicting the position, the confidence and the like of the obstacle is partially lost, and the detection effect is poor. According to the object detection method based on the artificial neural network, the attention network branch is added into the artificial neural network, the attention network branch is trained to accurately identify the visible part of the obstacle, and then the object detection model predicts the position, the size or the confidence level and other information of the obstacle by utilizing the key information of the visible part, so that an unmanned vehicle or an unmanned vehicle can accurately know the distribution, the size and the like of the obstacle in the surrounding environment so as to make an accurate driving track.

Fig. 4 is a schematic diagram of a system for object detection based on an artificial neural network according to an embodiment of the present application. The system 300 includes at least one lidar 310 and a processor 320. The system 300 may be a distributed perception processing system disposed on an autonomous vehicle, for example, at least one lidar 310 may be disposed on the roof of the vehicle and be a rotating lidar; lidar 310 may also be located elsewhere on the autonomous vehicle or use other forms of lidar. The processor 320 may be a super computing platform provided on the autonomous vehicle, i.e. the processor 320 may comprise one or more processing units in the form of CPU, GPU, FPGA or ASIC or the like for processing the sensor data acquired by the sensors of the autonomous vehicle.

In one implementation, lidar 310 is used to acquire a three-dimensional point cloud.

In one implementation, the processor 320 is configured to acquire a first feature map of the three-dimensional point cloud using a backbone neural network.

In one implementation, the processor 320 is further configured to process the first feature map using the attention branching neural network, and obtain a second feature map, where each position of the second feature map includes a predicted attention coefficient corresponding to the position, and the second feature map is further configured to obtain a loss function of the target object, where the loss function is configured to update the predicted attention coefficient.

In one implementation, the processor 320 is further configured to obtain a prediction result according to the second feature map, where the prediction result includes location information of the target object.

In one implementation, the processor 320 is further configured to divide the candidate box for the first feature map.

In one implementation, the processor 320 is further configured to generate, by the attention branching neural network, a predicted attention coefficient for the candidate box of the first feature map.

In one implementation, the processor 320 is further configured to dot multiply the predicted attention coefficient with the first feature map to obtain the second feature map.

In one implementation, the processor 320 is further configured to compare the predicted attention coefficient of the candidate box of the second feature map with the attention coefficient of the truth box in the sample feature map corresponding to the candidate box.

In one implementation, the processor 320 is further configured to determine a result of the attention loss function according to the predicted attention coefficient and the true attention coefficient when a confidence level of the predicted attention coefficient of the candidate box and the true attention coefficient of the true box is higher than a first threshold.

In one implementation, the processor 320 is further configured to update the predicted attention coefficient according to the result of the attention loss function such that the confidence level of the predicted attention coefficient is higher than the second threshold.

In one implementation, when the processor 320 is configured to update the predicted attention coefficient of the second feature map, the processor performs an exponential operation with an natural constant e on the updated predicted attention coefficient.

In one implementation, the processor 320 is further configured to update the predicted attention coefficient by a back propagation algorithm according to a result of the attention loss function.

In one implementation, the attention loss function isWherein k is the number of feature points in the candidate frame, L _a As a smoothL 1 loss function, m _k To predict the attention coefficient, t _k The true value is the attention coefficient.

In one implementation, the processor 320 is further configured to acquire three-dimensional point cloud data of the occluded target object.

In one implementation, the processor 320 is further configured to perform three-dimensional network partitioning on the three-dimensional point cloud data, and obtain a plurality of three-dimensional spatial voxels.

In one implementation, the processor 320 is further configured to obtain a point cloud characteristic of each voxel according to the point cloud density in each voxel.

In one implementation, the processor 320 is further configured to extract point cloud features using a backbone neural network and generate a first feature map.

In one implementation, the processor 320 is configured to generate, by the attention branching neural network, a predicted attention coefficient for a candidate box of the first feature map, including: the attention-branching neural network generates predicted attention coefficients by one or more of a convolution operation, a full-join, and variations of the convolution operation.

In one implementation, the processor 320 is configured to perform object detection on the target object through the artificial neural network, and obtain the three-dimensional position and the confidence of the feature map candidate frame corresponding to the visible portion of the target object.

In one implementation, the processor 320 is further configured to rank the confidence levels, and select candidate boxes with confidence levels higher than a third threshold.

In one implementation, the processor 320 is further configured to predict the information of the target object according to the candidate frame with the confidence level higher than the third threshold.

In one implementation, the information of the target object includes a position and/or a size of the target object.

In one implementation, the system 300 provided in the embodiment of the present application may further include a display, where the display is configured to display the prediction result of the target object predicted according to the second feature map.

It should be appreciated that the system of the object detection model based on the artificial neural network provided in the embodiments of the present application may be applied to the field of automatic driving of unmanned apparatuses such as unmanned aerial vehicles or unmanned vehicles, and is used for predicting an obstacle (such as other vehicles, pedestrians, etc.) in the surrounding environment of the unmanned movable apparatus, where the obstacle (i.e., the target object) may be an object that is partially blocked. According to the system provided by the embodiment of the application, the object detection model based on the artificial neural network, which can obtain the position information, the size information and the like of the target object according to the visible part information of the blocked target object, can be trained through the deep learning of the neural network. The object detection model can give more weight to the information of the visible part of the target object based on the attention mechanism in the trained process, namely, the object detection model is more sensitive to the information of the visible part, so that the object detection model can acquire the information of the target object more accurately according to the visible part of the target object in the subsequent prediction process.

Fig. 5 shows a schematic diagram of a system for object detection based on an artificial neural network according to an embodiment of the present application. The system 400 includes at least one receiving module 410 and a processing module 420.

In one implementation, the receiving module 410 is configured to obtain a three-dimensional point cloud.

In one implementation, the processing module 420 is configured to acquire a first feature map of the three-dimensional point cloud using a backbone neural network.

In one implementation, the processing module 420 is further configured to process the first feature map using the attention branching neural network, and obtain a second feature map, where each position of the second feature map includes a predicted attention coefficient corresponding to the position, and the second feature map is further configured to obtain a loss function of the target object, where the loss function is used to update the predicted attention coefficient.

In one implementation, the processing module 420 is further configured to obtain a prediction result according to the second feature map, where the prediction result includes location information of the target object.

In one implementation, the processing module 420 is further configured to divide the candidate box for the first feature map.

In one implementation, the processing module 420 is further configured to generate, by the attention branching neural network, a predicted attention coefficient for the candidate box of the first feature map.

In one implementation, the processing module 420 is further configured to perform point multiplication on the predicted attention coefficient and the first feature map to obtain a second feature map.

In one implementation, the processing module 420 is further configured to compare the predicted attention coefficient of the candidate box of the second feature map with the attention coefficient of the truth box in the sample feature map corresponding to the candidate box.

In one implementation, the processing module 420 is further configured to determine a result of the attention loss function according to the predicted attention coefficient and the true attention coefficient when a confidence level of the predicted attention coefficient of the candidate box and the true attention coefficient of the true box is higher than a first threshold.

In one implementation, the processing module 420 is further configured to update the predicted attention coefficient according to a result of the attention loss function such that the confidence level of the predicted attention coefficient is higher than the second threshold.

In one implementation, when the processing module 420 is configured to update the predicted attention coefficient of the second feature map, the processing module performs an exponential operation with a natural constant e on the updated predicted attention coefficient.

In one implementation, the processing module 420 is further configured to update the predicted attention coefficient by a back propagation algorithm according to a result of the attention loss function.

In one implementation, the processing module 420 is further configured to obtain three-dimensional point cloud data of the occluded target object.

In one implementation, the processing module 420 is further configured to perform three-dimensional network partitioning on the three-dimensional point cloud data, and obtain a plurality of three-dimensional spatial voxels.

In one implementation, the processing module 420 is further configured to obtain a point cloud characteristic of each voxel according to the point cloud density in each voxel.

In one implementation, the processing module 420 is further configured to extract the point cloud feature using the backbone neural network and generate the first feature map.

In one implementation, the processing module 420, configured to generate, by the attention branching neural network, a predicted attention coefficient for a candidate box of the first feature map, includes: the attention-branching neural network generates predicted attention coefficients by one or more of a convolution operation, a full-join, and variations of the convolution operation.

In one implementation, the processing module 420 is configured to perform object detection on the target object through the artificial neural network, and obtain the three-dimensional position and the confidence of the feature map candidate frame corresponding to the visible portion of the target object.

In one implementation, the processing module 420 is further configured to rank the confidence degrees, and select a candidate box with a confidence degree higher than the third threshold.

In one implementation, the processing module 420 is further configured to predict the information of the target object according to the candidate frame with the confidence level higher than the third threshold.

In one implementation, the system for object detection model provided in the embodiments of the present application may further include a display, where the display is configured to display the prediction result obtained according to the second feature map.

The embodiment of the invention also provides a chip system, which comprises at least one processor, and when the program instructions are executed in the at least one processor, the method provided by the embodiment of the application is realized.

The embodiment of the invention also provides a computer storage medium, on which a computer program is stored, which when executed by a computer causes the computer to perform the method of the above-described method embodiment.

The present invention also provides a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiment described above.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A system for object detection based on an artificial neural network, comprising at least one processor and a lidar, wherein,

The laser radar is used for acquiring a three-dimensional point cloud of a target object and inputting the three-dimensional point cloud into the processor;

the processor is used for carrying out three-dimensional grid division on the three-dimensional point cloud to obtain a plurality of voxels; according to the point cloud density in each voxel, determining the point cloud characteristics of the corresponding position of the voxel, extracting the point cloud characteristics through a backbone network of an object detection model, and generating a first characteristic map of the target object; generating a predicted attention coefficient in the first feature map by an attention branch neural network of the object detection model; calculating the result of the attention loss function according to the true attention coefficient and the predicted attention coefficient in the sample feature diagram by using the loss function branch neural network; updating the predicted attention coefficient according to the result of the attention loss function, so that the predicted attention coefficient generated by the feature map part corresponding to the visible part of the target object in the second feature map is higher than the predicted attention coefficient of the feature map part of the blocked part of the target object; and obtaining a prediction result according to the visible part information of the target object, wherein the prediction result comprises the position information of the target object.

2. The system of claim 1, wherein the processor is further configured to partition a candidate box for the first feature map; generating a predicted attention coefficient for a candidate frame of the first feature map through the attention branch neural network; and carrying out dot multiplication on the predicted attention coefficient and the first feature map to obtain the second feature map.

3. The system of claim 2, wherein the processor is further configured to compare the predicted attention coefficient of the candidate box of the second feature map to the attention coefficient of the truth box in the sample feature map to which the candidate box corresponds; determining a result of an attention loss function according to the predicted attention coefficient and the true value attention coefficient when the confidence level of the predicted attention coefficient of the candidate frame and the true value attention coefficient of the true value frame is higher than a first threshold; and updating the predicted attention coefficient according to the result of the attention loss function, so that the confidence coefficient of the predicted attention coefficient and the true value attention coefficient is higher than a second threshold.

4. A system according to any one of claims 1-3, wherein when the processor is configured to update the predicted attention coefficient of the second profile, the processor is configured to perform a natural constant e-exponent operation on the updated predicted attention coefficient.

5. The system of claim 3 or 4, wherein the processor is further configured to update the predicted attention coefficient by a back propagation algorithm based on a result of the attention loss function.

6. The system of any one of claims 3-5, wherein the attention loss function is, wherein,kfor the number of feature points in the candidate frame,L _a to smooth the least squares smooth L1 loss function,m _k for the predicted attention coefficient(s),t _k is the true value of the attention coefficient.

7. The system of any of claims 1-6, wherein the processor is further configured to acquire three-dimensional point cloud data of the occluded target object;

the processor is further used for carrying out three-dimensional network division on the three-dimensional point cloud data and obtaining a plurality of three-dimensional space voxels;

the processor is further configured to obtain a point cloud characteristic of each voxel according to a point cloud density in the voxel;

the processor is further configured to extract the point cloud feature using the backbone neural network, and generate the first feature map.

8. The system of any of claims 2-7, wherein the processor configured to generate, by the attention branching neural network, predicted attention coefficients for candidate boxes of the first feature map comprises:

The attention branching neural network generates the predicted attention coefficients by one or more of a convolution operation, a full join, and a variation of the convolution operation.

9. The system according to any one of claims 1-8, wherein the processor is configured to perform object detection on the target object through the artificial neural network, and obtain a three-dimensional position and a confidence level of a feature map candidate frame corresponding to a visible portion of the target object;

the processor is further configured to sort the confidence degrees, and select candidate frames with confidence degrees higher than a third threshold value;

the processor is further configured to predict information of the target object according to the candidate frame with the confidence level higher than a third threshold.

10. The system of claim 9, wherein the information of the target object includes a position and/or a size of the target object.

11. The system according to any one of claims 1-10, characterized in that the system comprises a display for displaying the prediction result obtained from the second profile.

12. A method of object detection based on an artificial neural network, the method comprising:

Acquiring a three-dimensional point cloud, and performing three-dimensional grid division on the three-dimensional point cloud to obtain a plurality of voxels; according to the point cloud density in each voxel, determining the point cloud characteristics of the corresponding position of the voxel, extracting the point cloud characteristics by using a main neural network of an object detection model, and acquiring a first characteristic map of the three-dimensional point cloud;

processing the first feature map by using an attention branch neural network of the object detection model, and acquiring a second feature map, wherein each position of the second feature map comprises a predicted attention coefficient corresponding to the position, the second feature map is also used for acquiring a loss function of a target object, and the result of the attention loss function is calculated by using the loss function branch neural network according to a true value attention coefficient and the predicted attention coefficient in a sample feature map; updating the predicted attention coefficient according to the result of the loss function, so that the predicted attention coefficient generated by the feature map part corresponding to the visible part of the target object in the second feature map is higher than the predicted attention coefficient of the feature map part of the blocked part of the target object, and the loss function is used for updating the predicted attention coefficient;

And obtaining a prediction result according to the second feature map, wherein the prediction result comprises the position information of the target object.

13. The method of claim 12, wherein processing the first signature using an attention branching neural network and obtaining a second signature comprises:

dividing a candidate frame for the first feature map;

generating a predicted attention coefficient for a candidate frame of the first feature map through the attention branch neural network;

and carrying out dot multiplication on the predicted attention coefficient and the first feature map to obtain the second feature map.

14. The method of claim 13, wherein the method further comprises:

comparing the predicted attention coefficient of the candidate frame of the second feature map with the attention coefficient of the truth frame in the sample feature map corresponding to the candidate frame;

determining a result of an attention loss function according to the predicted attention coefficient and the true value attention coefficient when the confidence level of the predicted attention coefficient of the candidate frame and the true value attention coefficient of the true value frame is higher than a first threshold;

and updating the predicted attention coefficient according to the result of the attention loss function, so that the confidence coefficient of the predicted attention coefficient and the true value attention coefficient is higher than a second threshold.

15. The method according to any one of claims 12-14, further comprising:

and after updating the predicted attention coefficient of the second characteristic diagram, performing an exponential operation of taking a natural constant e on the updated predicted attention coefficient.

16. The method according to claim 14 or 15, wherein said updating the predicted attention coefficient according to the result of the attention loss function comprises:

updating the predicted attention coefficient by a back propagation algorithm according to the result of the attention loss function.

17. The method according to any one of claims 14-16, wherein the attention loss function is, wherein,kfor the number of feature points in the candidate frame,L _a as a smoothl 1 loss function,m _k for the predicted attention coefficient(s),t _k is the true value of the attention coefficient.

18. The method of any of claims 12-17, wherein the acquiring a three-dimensional point cloud and acquiring a first feature map of the three-dimensional point cloud using a backbone neural network comprises:

acquiring three-dimensional point cloud data of a shielded target object;

dividing the three-dimensional point cloud data into three-dimensional networks and obtaining a plurality of three-dimensional space voxels;

According to the point cloud density in each voxel, obtaining the point cloud characteristics of the voxels;

and extracting the point cloud features by using the backbone neural network, and generating the first feature map.

19. The method of any of claims 13-18, wherein generating, by the attention branching neural network, predicted attention coefficients for candidate boxes of the first feature map comprises:

20. The method according to any one of claims 12-19, further comprising:

performing object detection on the target object through the artificial neural network, and obtaining the three-dimensional position and the confidence coefficient of the feature map candidate frame corresponding to the visible part of the target object;

sorting the confidence degrees, and selecting candidate frames with the confidence degrees higher than a third threshold value;

and predicting the information of the target object according to the candidate frames with the confidence coefficient higher than a third threshold value.

21. The method of claim 20, wherein the information of the target object includes a position and/or a size of the target object.

22. The method according to any one of claims 12-21, further comprising:

and displaying a prediction result obtained according to the second characteristic diagram.

23. The method according to any one of claims 12-22, wherein the method is adapted for a detection process of a target object in an automatic driving scene, wherein the target object is blocked by a part of the point cloud, comprising:

establishing an object detection model based on an artificial neural network;

inputting the three-dimensional point cloud of the target object into the object detection model, and carrying out three-dimensional grid division on the three-dimensional point cloud through the object detection model to obtain a plurality of voxels;

determining the point cloud characteristics of the voxel positions according to the point cloud density in each voxel;

extracting the point cloud features through a backbone network of the object detection model, and generating a first feature map of the target object;

generating a predicted attention coefficient in the first feature map by an attention branch neural network of the object detection model;

calculating the result of the attention loss function according to the true attention coefficient and the predicted attention coefficient in the sample feature diagram by using the loss function branch neural network;

Updating the network coefficients of the attention branch neural network according to the result of the attention loss function, so that the predicted attention coefficients generated by the attention branch neural network in the feature map part corresponding to the visible part of the target object are higher than the predicted attention coefficients generated by the attention branch neural network in the feature map part of the occluded part of the target object;

and obtaining a prediction result according to the visible part information of the target object, wherein the prediction result comprises the position information of the target object.

24. A computer storage medium having program instructions which, when executed directly or indirectly, cause the method of any one of claims 12 to 23 to be carried out.