CN115376107A

CN115376107A - Method and apparatus for target detection for smart driving

Info

Publication number: CN115376107A
Application number: CN202211078943.7A
Authority: CN
Inventors: 王磊; 陈新元; 吴子章; 王凡
Original assignee: Zongmu Technology Shanghai Co Ltd
Current assignee: Zongmu Technology Shanghai Co Ltd
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2022-11-22

Abstract

The present disclosure provides a method and apparatus for target detection for smart driving. A method for target detection, comprising: preprocessing camera data and radar data; extracting features from the preprocessed camera data and radar data to obtain a camera feature map and a radar feature map; and inputting the camera signature and the radar signature into a neural network to detect one or more targets.

Description

Method and apparatus for target detection for smart driving

Technical Field

The present disclosure relates generally to intelligent driving, and more particularly to methods and apparatus for 3D target detection based on fusion of camera image data with radar data.

Background

In intelligent driving (e.g., autonomous driving, assisted driving), the accuracy and efficiency of target (e.g., obstacle) detection will directly affect the driving performance of the vehicle. The target detection means commonly used at present include camera image detection and radar detection.

The camera image has abundant semantic information and has a mature target detection algorithm and a mature target detection model. However, the camera image detection means is not accurate enough to detect the target at a distance, and factors such as weather changes (e.g., rain, snow, fog, etc.) and lens stains have a large influence on the target detection, and the speed of the target cannot be directly predicted.

Radar (e.g., millimeter wave radar, laser radar, etc.) transmits a probe signal to a target, and then compares the received signal reflected from the target with the transmitted signal, and processes the signal to obtain information about the target (e.g., target distance, bearing, speed, etc.). The radar detection has the advantages of low cost, no interference from severe weather, strong long-distance detection capability and self-carrying target speed information and distance information. But the millimeter wave data is sparse.

Disclosure of Invention

In view of the above technical problems in the prior art, the present disclosure provides a method for target detection, including:

preprocessing camera data and radar data;

extracting features from the preprocessed camera data and radar data to obtain a camera feature map and a radar feature map; and

the camera signature and the radar signature are input to a neural network to detect one or more targets.

Optionally, the method further comprises training the neural network using training sample data, the training sample data comprising a training camera feature map, a training radar feature map and tag data, the tag data comprising data relating to one or more real targets, the training comprising:

inputting the training camera feature map and the training radar feature map into the neural network to obtain one or more predicted targets;

matching the one or more predicted targets with one or more real targets in the tag data;

determining a loss value using the matched predicted target and real target; and

adjusting a parameter of the neural network based on the loss value.

Optionally, the loss value is a weighted sum of a position loss value and a category loss value, the position loss value is an L1 loss value, and the category loss value is a focal loss value.

Optionally, matching the one or more predicted targets with the one or more real targets comprises: matching the one or more predicted targets with the one or more real targets using the Hungarian algorithm.

Optionally, the camera data comprises multiple camera data and the radar data comprises multiple radar data, the method further comprising:

and splicing the multi-path camera data into integral camera data, and splicing the multi-path radar data into integral radar data.

Optionally, pre-processing the camera data comprises: normalizing and filling the picture of the camera data; and is

Wherein preprocessing the radar data comprises: the radar data is converted to the same coordinate system as the camera data.

Optionally, extracting features from the preprocessed camera data and radar data comprises:

extracting features from the camera data by using a residual error network to obtain the camera feature map; and

and extracting features of the radar data by using a maximum pooling network to obtain the radar feature map.

Optionally, the neural network is a transformer network.

Another aspect of the present disclosure provides an apparatus for target detection, comprising:

a module for pre-processing camera data and radar data;

means for extracting features from the pre-processed camera data and radar data to obtain a camera feature map and a radar feature map; and

means for inputting the camera profile and the radar profile into a neural network to detect one or more targets.

Optionally, the apparatus further comprises means for training the neural network using training sample data, the training sample data comprising a training camera feature map, a training radar feature map and tag data, the tag data comprising data relating to one or more real targets, the means for training configured to:

adjusting a parameter of the neural network based on the loss value.

Yet another aspect of the disclosure provides an electronic device comprising a processor and a memory, the memory storing program instructions; the processor executes program instructions to implement the method for object detection as described above.

According to the technical scheme, the original information of each sensor (the camera and the radar) is combined through pre-fusion to form complete space occupation data, and misjudgment caused by partial recognition can be effectively reduced through complementation between the data. This openly fuses camera data and radar data, expands data dimension to higher dimensional space, and the information that contains is abundanter. Further, the neural network for target detection is trained using the location loss values and the class loss values, providing accuracy in target recognition. Experiments prove that the technical scheme of the method and the device can greatly improve the identification precision of the target.

Drawings

Fig. 1 is a diagram of an apparatus for target detection for smart driving according to aspects of the present disclosure.

Fig. 2 is a visualization diagram of a radar image.

Fig. 3 is a diagram of a residual network.

Fig. 4 is a diagram of a multi-scale converged network.

Fig. 5 illustrates a process of training a neural network in an object detection module, according to aspects of the present disclosure.

Fig. 6 is a flow diagram of a method for target detection for smart driving according to aspects of the present disclosure.

Fig. 7 is a diagram of an electronic device for object detection, in accordance with aspects of the present disclosure.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.

Fig. 1 is a diagram of an apparatus 100 for target detection for smart driving according to aspects of the present disclosure.

The apparatus 100 for target detection for smart driving includes a camera data pre-processing module 102, a radar data pre-processing module 104, a camera feature extraction module 106, a radar feature extraction module 108, a multi-scale fusion module 110, and a target detection module 112.

The input to the camera data pre-processing module 102 is image data obtained by a camera. The camera may be a camera mounted on the vehicle body. The image data may be data of an image frame captured by the camera, e.g., a pixel map of the image.

The camera data pre-processing module 102 may perform normalization and padding operations on the camera data. The camera data may be represented as a pixel map of the image.

The normalization operation may normalize each pixel value in a pixel map of the image. For example, if the pixel values of an image range [0,255], their range may be normalized to a value range [0,1]. The normalization of the image data can help the target detection model to converge more quickly, and the target detection speed is improved.

In one example, the normalization may employ a maximum-minimum method. In particular, the normalized pixel values may be determined using the following equation:

where x' is the normalized pixel value, x is the original pixel value, x _max Is the maximum pixel value, x, in the image _min Is the minimum pixel value in the image. X' can be calculated to be in the range of 0 and 1 according to the above formula.

While the use of the maximum-minimum method for normalization is illustrated above, those skilled in the art will appreciate that other methods of normalization are also contemplated by the present disclosure.

The padding operation may pad the pixmap so that the resolution of the image data and the radar data are the same for subsequent processing.

As an example, the filling of the pixel map may include filling pixels around a pixel as a value of the pixel, filling 0, filling a maximum value, a minimum value, an average value, and so on of pixel values of the image.

The radar data pre-processing 104 may project the radar data to a corresponding camera image coordinate system.

The radar data may also be referred to as point cloud data, and may be data of a signal reflected from a target by a detection signal transmitted by a radar (e.g., millimeter wave radar, laser radar).

Preferably, the radar data of the present disclosure may be millimeter wave radar data, obtained via millimeter wave radar. The millimeter wave radar is low in cost, so that the cost can be reduced while the accuracy of target detection is ensured.

Each point in the point cloud data may include location information (e.g., three-dimensional coordinate information) of the target, reflected light intensity, target direction, speed (e.g., speed of the target relative to the host vehicle), and so forth. The radar data may be represented as a radar image. Fig. 2 is an example of a visualization diagram of a radar image, in which straight lines are plotted for a detected target. The radar data may also include data (position information, reflected light intensity, target direction, speed, etc., as described above) for each point on the plotted straight line.

For example, the radar data pre-processing 104 may determine a conversion matrix of radar data to camera data from internal and external parameters of the radar and internal and external parameters of the camera, and then use the conversion matrix to project the radar data to a corresponding camera image coordinate system.

The internal parameters may include focal lengths fx, fy, principal point coordinates x0, y0, coordinate axis tilt parameters, distortion (tangential distortion, radial distortion) parameters, and the like.

The external parameters may include parameters for converting points from a world coordinate system to a device (e.g., camera) coordinate system. For example, the external parameters may include a rotation matrix representing the orientation of the coordinate axes of the world coordinate system relative to the camera coordinate axes and a translation matrix representing the location of the spatial origin under the camera coordinate system.

The radar data are projected to the corresponding camera image coordinate system, so that the target detection is facilitated under the same coordinate system, and the detection accuracy is improved.

The input camera data of the camera data pre-processing module 102 and the input radar data of the radar data pre-processing module 104 may be paired camera data and radar data.

In one example, camera data and radar data obtained by one camera and one radar, respectively, that are installed in close proximity may be input to the camera data pre-processing module 102 and the radar data pre-processing module 104, respectively, to obtain a pair of camera data and radar data. In other words, a pair of camera data and radar data may represent camera images and radar information within the same range.

In another example, multiple cameras and multiple radars may be used to acquire the multiple camera data and the multiple radar data, respectively.

For example, 6 cameras (e.g., installed at the front left, rear left, front right, rear right, front right, and rear right of the vehicle) and 5 radars (e.g., installed at the left and right of the head of the vehicle, the left and right of the tail of the vehicle, and the front right of the vehicle) may be installed on the vehicle body, thereby obtaining 6-way camera data and 5-way radar data. Experiments prove that the 6-path camera data and the 5-path radar data can cover 360 degrees around the vehicle, a good target detection effect is achieved, and excessive processing burden on the vehicle-mounted computer equipment cannot be caused by processing the 6-path camera data and the 5-path radar data.

The multi-channel camera data may be stitched into overall camera data (e.g., a camera panorama) and the multi-channel radar data may be stitched into overall radar data (e.g., a radar panorama). By means of camera pre-processing and radar pre-processing, the camera panorama and the radar panorama can be transformed to the same coordinate system and with the same resolution.

Camera feature extraction module 106 may perform feature extraction on the preprocessed camera data to generate a camera feature map, and radar feature extraction module 108 may perform feature extraction on the preprocessed radar data to generate a radar feature map.

The camera feature extraction module 106 may include a Convolutional Neural Network (CNN) model.

In one example, the CNN may be a residual network. Fig. 3 is a diagram of a residual network. The residual network includes identity mapping (identity mapping) and residual mapping (residual mapping). The identity map refers to the "curved line" portion of fig. 3, and the residual map refers to the remaining portion that is not the "curved line". X may represent a feature map of the input. F (x) is the network map before the summing node, and H (x) = F (x) + x is the network map after input to the summing node. The residual network can transmit shallow information to a deep network layer, so that the problems of gradient disappearance, network degradation and the like are avoided, and the feature extraction and mapping capability of the network is greatly enhanced. Even if the depth of the network is increased, the training error of the network should be no higher than that of the original shallow network.

The method and the device have the advantages that the camera data are dense, the residual error network is used for extracting the features, the important features can be extracted, and the processing amount of a computer is reduced under the condition that the accuracy of target detection is not influenced.

The radar feature extraction module 108 may include a max pooling network, such as a three-level max pooling network.

Maximum pooling is the maximization of local values. The maximum pooling can obtain local information, and better retain the characteristics on the texture.

The radar data is sparse, and the maximum pooling network can reduce useless features and enable the useful features to be denser.

The multi-scale fusion module 110 may perform downsampling on the feature maps (the camera feature map and the radar feature map) to different degrees to obtain feature maps with different resolutions, and then perform information fusion on the obtained feature maps with different resolutions, so as to increase the capability of the model for detecting targets with different sizes.

Fig. 4 is a diagram of a multi-scale converged Network (Feature Pyramid Network). The multi-scale fusion network is also referred to as a feature pyramid network. The multi-scale fusion network carries out downsampling on the original feature map to different degrees to obtain feature maps with different resolutions (different sizes), and carries out information fusion (splicing) on the obtained feature maps with different resolutions so as to increase the capability of the model for detecting targets with different sizes.

As shown in fig. 4, the left lowermost layer is an original feature map, and the feature map is a feature map whose resolution is gradually reduced by downsampling the feature map (for example, downsampling a feature map in the tensor format) from bottom to top. Each feature map with different resolution is output to a multi-scale fusion network (as shown on the right) for subsequent target detection. The convolutional neural network can respectively predict on feature maps with different sizes, so that the multi-scale prediction capability is realized.

The application of the multi-scale fusion network is beneficial to distinguishing targets with different sizes. For example, it may be desirable to detect different sized targets (e.g., obstacles) in target detection for smart driving. Larger objects, such as buildings, etc., can be identified using the less resolved feature maps. Smaller objects, such as vehicles, pedestrians, etc., can be identified using the higher resolution feature maps.

Note that while fig. 1 illustrates the multi-scale fusion module 110, in some implementations, the multi-scale fusion module 110 may also be omitted.

The target detection module 112 may input the camera feature map and the radar feature map output by the multi-scale fusion network 110 into a trained neural network model to detect one or more targets. The neural network model may be a transformer model.

The Transformer model is a neural network model that includes an encoder and a decoder. The encoder may perform feature learning in a self-attentive manner under the global receptive field, such as the features of points in a feature map. The decoder learns the characteristics of the desired module, e.g., the characteristics of the output detection box, by self-attention and cross-attention. Note that the mechanism can be considered to occur between the encoder and the decoder, in other words, between the input signature and the output detection target. The self-attention mechanism provides an effective modeling way to capture global context information through QKV. Assuming the input is Q (query), the context is stored in the form of a key-value pair (K, V), or (key, value). Note that the force mechanism is actually a mapping function of Q onto a series of key-value pairs (K, V). The nature of the attention function can be described as a mapping of one query to a series of (key, value) pairs. Attention is essentially given to assigning a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in (K, V) form, then attention is paid to addressing by calculating the similarity of Q and K. The similarity calculated by Q and K reflects the importance degree, namely the weight, of the extracted V value, and then the final characteristic value is obtained by weighted summation. The calculation of attention is mainly divided into three steps, the first step is to calculate the similarity of the query and each key to obtain weight, and common similarity functions comprise dot product, splicing, perception machine and the like; the second step then normalizes these weights, typically using a softmax function; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the final characteristic value.

Fig. 5 illustrates a process of training a neural network in the object detection module 112, according to aspects of the present disclosure.

The training sample data may include a pair of camera feature maps and radar feature maps, as well as label data. The tag data includes data relating to one or more real targets corresponding to the camera profile and the radar profile, e.g., real location data, real category data, etc. for each real target.

In the present disclosure, the position data may be coordinate values of the object, and the category data may represent a category of the object. In one example, the categories of objects may include buildings, vehicles, pedestrians, and the like. In another example, vehicles may be classified into categories: trucks, buses, cars, school buses, etc.

At 502, a pair of camera and radar signatures may be input to a neural network to predict one or more targets (referred to herein as predicted targets), which are referred to herein as a set of predicted targets.

The output of the neural network with respect to each predicted target may include at least predicted location data and predicted category data.

The predicted position data may be coordinate values of the predicted target.

The prediction category data may be a prediction category vector for the predicted target. Prediction class vector (C) _p1 ,C _p2 ,…C _pq ) May correspond to a category (e.g., building, vehicle, pedestrian, etc., as described above), the ith element C _pi The probability that the predicted target is in the ith category may be expressed.

The dimension of the real category vector of each real target is the same as the dimension of the prediction category vector of the predicted target, wherein the real category vector (C) _t1 ,C _t2 ,…C _tq ) Element C of (d) corresponding to category j of the real target _tj Is 1, and the other elements are all 0.

At 504, one or more targets predicted by the neural network may be matched with one or more real targets (referred to herein as a set of real targets) in the tag data.

In an aspect, a Hungarian algorithm can be used to bipartite graph match one or more predicted targets with one or more real targets in tag data.

Note that although the expression "one or more" is used here with respect to both the predicted target and the actual target, the number of predicted targets and actual targets may be different.

In particular, the set of predicted targets may include M predicted targets (O) _p1 ,O _p2 ,…O _pM ) The set of real targets may comprise N real targets (O) _t1 ,O _t2 ,…O _tN ). In one example, M ≧ N.

The M predicted targets may be matched against the N real targets to determine a predicted target that matches each real target, thereby obtaining N matched predicted targets (referred to herein as a set of matched targets).

For example, M predicted targets (O) may be combined _p1 ,O _p2 ,…O _pM ) With N real targets (O) _t1 ,O _t2 ,…O _tN ) Calculating the matching values two by two, determining the predicted target with the maximum matching value of each real target as the matching target of the real target, thereby obtaining N matching targets (O) _m1 ,O _m2 ,…O _mN ) In which O is _mi And O _pi And (4) matching.

In one example, the match value may be determined according to the location, category (e.g., building, vehicle, pedestrian, etc.) of the object, and so forth.

In one example, the match value may be a location distance, a category distance, or a combination thereof (e.g., a weighted sum) of the predicted target and the real target.

In an aspect, the location distance of the prediction target may be determined according to the coordinate values of the prediction target and the real target.

In another aspect, the class distance may be determined from a prediction class vector of the predicted target and a real class vector of the real target. For example, a vector distance of the prediction class vector from the true class vector may be calculated as its class distance.

At 506, a total loss value between the matching target set and the true target set may be determined.

Target loss values between each real target and the corresponding matching target may be calculated, and the calculated target loss values are summed to obtain a total loss value.

In an aspect, the target loss value may be a weighted sum of a plurality of loss values (e.g., a location loss value and a category loss value).

The position loss value may be an L1 loss function that matches the position data of the target and the real target.

The class loss value may be the focal loss value of the class of the matching target and the real target. The Focal loss value may be the category distance as calculated in step 504.

The plurality of loss values may further include an orientation loss value, a speed loss value.

The orientation loss value may be an L1 loss function of the orientation data of the matching target and the real target, and the speed loss value may be an L1 loss function of the speed data of the matching target and the real target.

At 508, parameters of the neural network model may be adjusted according to the calculated total loss value.

For example, a gradient descent method may be used to adjust parameters of the neural network model (e.g., weights of the neural network nodes).

Steps 502-508 may be iterated until the loss values converge, resulting in a trained neural network model.

At step 602, the camera data and radar data may be pre-processed.

In an aspect, pre-processing the camera data may include: normalizing and filling the picture of the camera data; preprocessing the radar data includes: the radar data is converted to the same coordinate system as the camera data.

At step 604, features may be extracted from the preprocessed camera data and radar data to obtain a camera feature map and a radar feature map.

In an aspect, extracting features from the preprocessed camera data and radar data may include: extracting features from the camera data by using a residual error network to obtain the camera feature map; and extracting features from the radar data by using a maximum pooling network to obtain the radar feature map.

At step 606, the camera signature and the radar signature can be input to a neural network to detect one or more targets.

In one aspect, the method further comprises training the neural network using training sample data, the training sample data comprising a training camera feature map, a training radar feature map, and tag data, the tag data comprising data relating to one or more real targets, the training comprising: inputting the training camera feature map and the training radar feature map into the neural network to obtain one or more predicted targets; matching the one or more predicted targets with one or more real targets in the tag data; determining a loss value using the matched predicted target and real target; and adjusting a parameter of the neural network according to the loss value.

In an aspect, the loss value is a weighted sum of a location loss value and a category loss value, the location loss value is an L1 loss value, and the category loss value is a focal loss value.

In an aspect, wherein matching the one or more predicted targets with the one or more real targets comprises: matching the one or more predicted targets with the one or more real targets using the Hungarian algorithm.

In one aspect, the camera data includes multiple camera data and the radar data includes multiple radar data, the method further comprising:

In one aspect, the neural network is a transformer network.

As shown in fig. 7, the electronic device 700 may include a memory 702 and a processor 704. The memory 702 has stored therein program instructions, the processor 704 may be coupled to and in communication with the memory 702 via the bus 706, and the processor 704 may call the program instructions in the memory 702 to perform the steps of: preprocessing camera data and radar data; extracting features from the preprocessed camera data and radar data to obtain a camera feature map and a radar feature map; and inputting the camera signature and the radar signature into a neural network to detect one or more targets.

Optionally, processor 704 may also call program instructions in memory 702 to perform the following steps: training the neural network using training sample data, the training sample data comprising a training camera feature map, a training radar feature map and tag data, the tag data comprising data relating to one or more real targets, the training comprising: inputting the training camera feature map and the training radar feature map into the neural network to obtain one or more predicted targets; matching the one or more predicted targets with one or more real targets in the tag data; determining a loss value using the matched predicted target and real target; and adjusting a parameter of the neural network according to the loss value.

Optionally, processor 704 may also call program instructions in memory 702 to perform the following steps: matching the one or more predicted targets with the one or more real targets using the Hungarian algorithm.

Optionally, the second loss value is determined according to a photometric loss function.

Optionally, the camera data includes multiple camera data and the radar data includes multiple radar data, the processor 704 may also invoke program instructions in the memory 702 to perform the following steps: and splicing the multi-path camera data into integral camera data, and splicing the multi-path radar data into integral radar data.

Optionally, processor 704 may also call program instructions in memory 702 to perform the following steps: normalizing and filling the picture of the camera data; and converting the radar data to the same coordinate system as the camera data.

Optionally, processor 704 may also call program instructions in memory 702 to perform the following steps: extracting features from the camera data by using a residual error network to obtain the camera feature map; and extracting features from the radar data by using a maximum pooling network to obtain the radar feature map.

Optionally, the neural network is a transformer network.

The illustrations set forth herein in connection with the figures describe example configurations and are not intended to represent all examples that may be implemented or fall within the scope of the claims. The term "exemplary" as used herein means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other examples. The detailed description includes specific details to provide an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described examples.

In the drawings, similar components or features may have the same reference numerals. Further, various components of the same type may be distinguished by following the reference label by a dash and a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.

The various illustrative blocks and modules described in connection with the disclosure herein may be implemented or performed with a general purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).

The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Other examples and implementations are within the scope of the disclosure and the following claims. For example, due to the nature of software, the functions described above may be implemented using software executed by a processor, hardware, firmware, hard-wired, or any combination thereof. Features that implement a function may also be physically located at various positions, including being distributed such that portions of the function are implemented at different physical locations. In addition, as used herein, including in the claims, "or" as used in a list of items (e.g., a list of items accompanied by a phrase such as "at least one of" or "one or more of") indicates an inclusive list, such that, for example, a list of at least one of a, B, or C means a or B or C or AB or AC or BC or ABC (i.e., a and B and C). Also, as used herein, the phrase "based on" should not be read as referring to a closed condition set. For example, an exemplary step described as "based on condition a" may be based on both condition a and condition B without departing from the scope of the present disclosure. In other words, the phrase "based on," as used herein, should be interpreted in the same manner as the phrase "based, at least in part, on.

Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Non-transitory storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, non-transitory computer-readable media can comprise RAM, ROM, electrically erasable programmable read-only memory (EEPROM), compact Disk (CD) ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other non-transitory medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes CD, laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

The description herein is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for target detection, comprising:

preprocessing camera data and radar data;

inputting the camera feature map and the radar feature map into a neural network to detect one or more targets.

2. The method of claim 1, further comprising training the neural network using training sample data, the training sample data comprising a training camera feature map, a training radar feature map, and label data, the label data comprising data related to one or more real targets, the training comprising:

adjusting a parameter of the neural network according to the loss value.

3. The method of claim 2, wherein the loss value is a weighted sum of a location loss value and a category loss value, the location loss value is an L1 loss value, and the category loss value is a focal loss value.

4. The method of claim 2, wherein matching the one or more predicted targets to the one or more real targets comprises: matching the one or more predicted targets with the one or more real targets using a Hungarian algorithm.

5. The method of claim 1, wherein the camera data comprises multi-channel camera data and the radar data comprises multi-channel radar data, the method further comprising:

6. The method of claim 1, wherein pre-processing the camera data comprises: normalizing and populating a picture of the camera data; and is

Wherein preprocessing the radar data comprises: converting the radar data to the same coordinate system as the camera data.

7. The method of claim 1, wherein extracting features from the pre-processed camera data and radar data comprises:

extracting features from the camera data using a residual error network to obtain the camera feature map; and

extracting features from the radar data using a max-pooling network to obtain the radar feature map.

8. The method of claim 1, wherein the neural network is a transformer network.

9. An apparatus for target detection, comprising:

a module for pre-processing camera data and radar data;

a module for extracting features from the preprocessed camera data and radar data to obtain a camera feature map and a radar feature map; and

means for inputting the camera signature and the radar signature into a neural network to detect one or more targets.

10. The apparatus of claim 9, further comprising means for training the neural network using training sample data, the training sample data comprising a training camera feature map, a training radar feature map, and label data, the label data comprising data related to one or more real targets, the means for training configured to:

adjusting a parameter of the neural network according to the loss value.

11. An electronic device comprising a processor and a memory, the memory storing program instructions; the processor executes program instructions to implement the method for object detection as claimed in any one of claims 1 to 8.