WO2022012158A1

WO2022012158A1 - Target determination method and target determination device

Info

Publication number: WO2022012158A1
Application number: PCT/CN2021/094781
Authority: WO
Inventors: 原崧育; 杨臻; 张维
Original assignee: 华为技术有限公司
Priority date: 2020-07-17
Filing date: 2021-05-20
Publication date: 2022-01-20
Also published as: CN114022830A

Abstract

A target determination method, relating to the field of perception fusion, and applied in an intelligent vehicle or an intelligent connected vehicle. The method comprises: obtaining an image to be processed and a plurality of millimeter wave detection points, said image and the plurality of millimeter wave detection points being data synchronously obtained for a same detection target, and each millimeter wave detection point comprising depth information; mapping the plurality of millimeter wave detection points to said image; determining, according to first information, a plurality of candidate boxes of the detection target on said image, the first information comprising the depth information and position information of each millimeter wave detection point; and performing non-maximum suppression (NMS) processing on the plurality of candidate boxes according to the depth information to output a target box and a target millimeter wave detection point, the target box being determined according to depth information and position information of the target millimeter wave detection point. The method can improve the accuracy of correlation of detection results of a visual sensor and a millimeter wave radar.

Description

A target determination method and target determination device

This application claims the priority of the Chinese patent application filed on July 17, 2020 with the application number 202010692086.4 and the application title is "A target determination method and target determination device", the entire contents of which are incorporated herein by reference Applying.

technical field

The present application relates to the field of communication technologies, and in particular, to a target determination method and a target determination device.

Background technique

Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.

Object detection and recognition refers to finding objects from a scene (eg, an image), which can include two processes of detection and recognition. The detection specifically refers to judging whether there is a target, and if there is a target, determining the position of the target. Identifying specifically refers to identifying categories of targets. Object detection and recognition have a wide range of applications in many fields of life, such as automatic driving, driving assistance and early warning. In the process of target detection and recognition, multi-sensor fusion is usually required, for example, the data collected by lidar, millimeter-wave radar, vision sensor, infrared sensor, etc. Detection and identification of objects in the environment.

However, to accurately fuse multi-sensor data, one-to-one correlation of targets in different sensors is required, that is, multi-sensor target matching. After completing the multi-sensor target matching, the accurate information of the target can be obtained through fusion. Due to the different characteristics of different sensors, it is difficult to correlate the detection results of heterogeneous sensors. Among them, the correlation between the detection results of vision sensors and millimeter-wave radars is particularly difficult.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a method for determining a target, so that the accuracy of the correlation between the detection results of the vision sensor and the millimeter wave radar can be improved.

To achieve the above purpose, the embodiments of the present application provide the following technical solutions:

A first aspect of the present application provides a target determination method, and the method provided by the present application can be adapted to the field of automatic driving or the field of monitoring. It may include: acquiring a to-be-processed image and multiple millimeter-wave detection points, where the to-be-processed image and the multiple millimeter-wave detection points are data obtained synchronously for the same detection target. The working principle of millimeter-wave radar is to use high-frequency circuits to generate electromagnetic waves with specific modulation frequencies, and to send electromagnetic waves and receive electromagnetic waves from the target through the antenna, and calculate the parameters of the target through the parameters of the transmitted and received electromagnetic waves. Millimeter-wave radar can measure distance, speed and azimuth of multiple targets at the same time. Velocity measurement is based on the Doppler effect, while azimuth measurement (including horizontal and vertical angles) is achieved by means of an array of antennas. It can be understood that a millimeter-wave detection point includes various parameters of the target. Specifically, the millimeter-wave detection point in this solution includes depth information, that is, parameters obtained through ranging. Of course, the millimeter wave detection point may also include other parameters. For example, a millimeter wave detection point may include the depth information of the target, the speed information of the target (parameters obtained through speed measurement), and the azimuth information of the target (parameters obtained through azimuth measurement). The data acquired synchronously can be understood as the millimeter-wave radar and the image sensor simultaneously collect data, or it can be understood as the deviation of the frame rate of the data collected by the millimeter-wave radar and the image sensor within a preset range. For example, for the same detection target, the millimeter-wave radar collects the millimeter-wave detection points according to the first frame rate, and the image sensor collects the image to be processed according to the second frame rate, and the deviation between the first frame rate and the second frame rate is less than the preset threshold, namely It can be considered that the millimeter-wave radar and the image to be processed are data acquired synchronously. The image to be processed can be obtained through the vision sensor, and multiple millimeter-wave detection points can be obtained through the millimeter-wave radar. When the method provided in this application is applied in the scene of automatic driving, the image to be processed may be an image obtained by the vehicle through a visual sensor, and specifically, the image to be processed may be an image captured by the vehicle through a camera installed on the vehicle. When the method provided in this application is applied to a monitoring scene, the image to be processed may be an image acquired by a visual sensor installed on the roadside, and specifically, the image to be processed may be an image captured by a camera installed on the roadside. Each millimeter-wave detection point may include depth information, where the depth information is used to represent the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to acquire multiple millimeter-wave detection points. The detection target can be any target such as vehicles, people, trees, etc. Map multiple mmWave detection points onto the image to be processed. A plurality of candidate frames of the detection target on the to-be-processed image are determined according to the first information. The first information may include depth information and position information of each millimeter-wave detection point. The position information is used to indicate that each millimeter-wave detection point is mapped on the to-be-processed image. Process the position on the image. A set of candidate frames can be determined according to the depth information and position information of each millimeter wave detection point, and the set of candidate frames includes multiple candidate frames. Perform non-maximun suppression (NMS) processing on multiple candidate frames according to the depth information to output the target frame and the target millimeter wave detection point. The target frame is based on the depth information and position information of the target millimeter wave detection point. definite. It can be seen from the first aspect that by mapping the millimeter wave detection points to the image to be processed, multiple candidate frames of the to-be-processed image are determined according to the position information and depth information of the millimeter wave detection points, and the multiple candidate frames are processed according to the depth information. In NMS processing, when the final candidate frame is determined, the millimeter wave detection points associated with the candidate frame can be output to improve the accuracy of target matching.

Optionally, in combination with the above-mentioned first aspect, in a first possible implementation manner, performing non-maximum value suppression NMS processing on multiple candidate frames according to depth information may include: pairing multiple candidate frames according to the first score and the second score. Each candidate frame is subjected to non-maximum value suppression NMS processing, and the first score represents the probability that the detection target in each candidate frame belongs to each of the N categories determined by the classifier, and the N categories are preset Category, N is a positive integer, and the second score indicates the probability that the detection target in each candidate frame belongs to each of the N categories, determined according to the depth information and the first probability distribution between each category. It can be known from the first possible implementation manner of the first aspect that a specific manner of performing NMS processing on multiple candidate frames according to depth information is provided. The first possible implementation method of the first aspect is to perform NMS processing on multiple candidate boxes through the first score and the second score to improve the accuracy of data association, that is, to improve the accuracy of the one-to-one matching of the same target in different sensors. .

Optionally, in combination with the first possible implementation manner of the above-mentioned first aspect, in the second possible implementation manner, the method may further include: performing statistics on the data in the first set, and determining the statistics corresponding to each category The probability distribution of the first size of the target, the first set may include a plurality of statistical targets corresponding to each category, and size information of each statistical target. The first probability distribution is determined according to the probability distribution of the first size and the first relationship, where the first relationship is the relationship between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target. It can be seen from the second possible implementation manner of the first aspect that a specific manner of how to determine the first probability distribution is given, which increases the diversity of the scheme.

Optionally, in combination with the second possible implementation manner of the above-mentioned first aspect, in a third possible implementation manner, the method may further include: performing statistics on the data in the second set, and determining the statistics corresponding to each category The probability distribution of the second size of the objects, the second set may include a plurality of statistical objects corresponding to each category, and size information of each statistical object. The second probability distribution is determined according to the second size distribution and the second relationship, the second probability distribution is used to update the first probability distribution, and the second relationship is the difference between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target relationship between. It can be known from the third possible implementation manner of the first aspect that the data can be updated. For example, in an autonomous driving scenario, the depth information and the probability distribution between each category can be determined through the updated data.

Optionally, in combination with the second possible implementation manner of the first aspect or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the size information is height information of the statistical target. It can be known from the fourth possible implementation manner of the first aspect that a specific category of size information is given, which increases the diversity of solutions.

Optionally, in combination with the first aspect or the fourth possible implementation manner of the first aspect to the first aspect, in a fifth possible implementation manner, the location information is used in combination with the millimeter wave detection point on the vehicle. The distribution characteristics of , determine the position of the candidate frame in the image to be processed. It can be seen from the fifth possible implementation manner of the first aspect that when determining the position of the candidate frame in the image to be processed, considering the distribution characteristics of the millimeter wave detection points on the vehicle, the position of the millimeter wave detection points can be better determined. Determine the position of the detection target in the image to be processed. For example, if the relationship between the vehicle and the millimeter-wave detection points is determined according to the distribution characteristics of the millimeter-wave detection points on the vehicle, the millimeter-wave detection point is generally located in the lower left corner of the target vehicle, then the millimeter-wave detection point can be located in the lower left corner of the prior frame. The corners determine multiple a priori boxes. If the distribution characteristics of the millimeter-wave detection points are not considered, the position of the a priori frame is arbitrarily determined according to the millimeter-wave detection points. The probability of detecting the target will decrease.

Optionally, in combination with the first aspect or the fifth possible implementation manner of the first aspect to the first aspect, in the sixth possible implementation manner, the depth information is used to determine the size of the candidate frame, and the candidate frame The size of is negatively correlated with depth information.

Optionally, in combination with the first aspect or the sixth possible implementation manner of the first aspect to the first aspect, in the seventh possible implementation manner, the method may further include: using an efficient regional convolutional neural network (faster regions with convolution neural network, Faster-RCNN) process the image to be processed to obtain the first feature map of the image to be processed. The second feature maps corresponding to the plurality of candidate frames are extracted from the first feature map. The second feature map is processed through a regression network and a classifier to obtain a first result, and the first result is used for non-maximum suppression NMS processing.

Optionally, in combination with the first aspect or the seventh possible implementation manner of the first aspect to the first aspect, in the eighth possible implementation manner, the image to be processed is acquired by the visual sensor, and the sampling of the visual sensor is used. The frequency is the first frequency, the sampling frequency of the millimeter wave radar is the second frequency, and the difference between the first frequency and the second frequency is not greater than a preset threshold.

A second aspect of the present application provides a device for determining a target, which may include: an acquisition module configured to acquire an image to be processed and multiple millimeter-wave detection points, where the image to be processed and the multiple millimeter-wave detection points are data obtained synchronously for the same target , each millimeter-wave detection point can include depth information, the depth information is used to indicate the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to obtain multiple millimeter-wave detection points. The mapping module is used to map the multiple millimeter wave detection points acquired by the acquisition module to the to-be-processed image acquired by the acquisition module. The processing module is configured to determine a plurality of candidate frames of the detection target on the image to be processed according to the first information, the first information may include depth information and position information of each millimeter wave detection point, and the position information is used to represent each millimeter wave The location of the detection point mapped on the image to be processed. The processing module is also used to perform non-maximum suppression NMS processing on multiple candidate frames according to the depth information, so as to output the target frame and the target millimeter wave detection point. The target frame is determined according to the depth information and position information of the target millimeter wave detection point. of.

Optionally, in combination with the above second aspect, in a first possible implementation manner, the processing module is specifically configured to: perform non-maximum suppression NMS processing on multiple candidate frames according to the first score and the second score, and the first One score represents the probability that the detection target in each candidate frame belongs to each of the N categories determined by the classifier, where the N categories are preset categories, N is a positive integer, and the second score represents the depth The probability that the detection target in each candidate frame belongs to each of the N categories, determined by the first probability distribution between the information and each category.

Optionally, in combination with the first possible implementation manner of the second aspect, in the second possible implementation manner, the target determination device may further include a statistics module, a statistics module for performing statistics on the data in the first set. , determine the probability distribution of the first size of the statistical objects corresponding to each category, and the first set may include multiple statistical objects corresponding to each category, and size information of each statistical object. The first probability distribution is determined according to the probability distribution of the first size and the first relationship, where the first relationship is the relationship between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target.

Optionally, in combination with the second possible implementation manner of the second aspect, in a third possible implementation manner, the statistics module is further configured to: perform statistics on the data in the second set, and determine the corresponding data of each category. The probability distribution of the second size of the statistical objects, the second set may include a plurality of statistical objects corresponding to each category, and size information of each statistical object. The second probability distribution is determined according to the second size distribution and the second relationship, the second probability distribution is used to update the first probability distribution, and the second relationship is the difference between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target relationship between.

Optionally, in combination with the second possible implementation manner of the second aspect or the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the size information is height information of the statistical target.

Optionally, in combination with the second aspect or the fourth possible implementation manner of the first to the second aspect, in a fifth possible implementation manner, the location information is used in combination with the millimeter wave detection point on the vehicle. The distribution characteristics of , determine the position of the candidate frame in the image to be processed.

Optionally, in combination with the second aspect or the fifth possible implementation manner of the second aspect or the first to the second aspect, in the sixth possible implementation manner, the depth information is used to determine the size of the candidate frame, and the candidate frame The size of is negatively correlated with depth information.

Optionally, in combination with the second aspect or the sixth possible implementation manner of the second aspect or the first to the second aspect, in the seventh possible implementation manner, the processing module is further configured to: The processed image is processed to obtain a first feature map of the to-be-processed image. The second feature maps corresponding to the plurality of candidate frames are extracted from the first feature map. The second feature map is processed through a regression network and a classifier to obtain a first result, and the first result is used for non-maximum suppression NMS processing.

Optionally, in combination with the second aspect or the seventh possible implementation manner of the second aspect or the first to the second aspect, in the eighth possible implementation manner, the image to be processed is acquired by the visual sensor, and the sampling of the visual sensor is used. The frequency is the first frequency, the sampling frequency of the millimeter wave radar is the second frequency, and the difference between the first frequency and the second frequency is not greater than a preset threshold.

A third aspect of the present application provides a smart car. The smart car may include a processor, the processor is coupled with a memory, and the memory stores program instructions. When the program instructions stored in the memory are executed by the processor, the first aspect or any one of the first aspect methods described in possible implementations.

A fourth aspect of the present application provides a monitoring device. The monitoring device has a processor, the processor is coupled to a memory, and the memory stores program instructions. When the program instructions stored in the memory are executed by the processor, the first aspect or any one of the first aspects may be possible. The method described in the embodiment.

A fifth aspect of the present application provides a computer-readable storage medium, which may include a program that, when executed on a computer, causes the computer to execute the method described in the first aspect or any possible implementation manner of the first aspect.

A sixth aspect of the present application provides a target determination system. The target determination system may include an end-side device and a cloud-side device, and the end-side device is used to acquire a to-be-processed image and multiple millimeter wave detection points, the to-be-processed image and multiple millimeter wave detection points. The wave detection point is the data obtained synchronously for the same detection target. Each millimeter wave detection point can include depth information. The depth information is used to indicate the distance between the detection target and the millimeter wave radar. The millimeter wave radar is used to obtain multiple millimeter wave detection points. . The cloud-side device is used to receive the to-be-processed image and multiple millimeter-wave detection points sent by the end-side device. The cloud-side device is also used to map multiple millimeter wave detection points to the image to be processed. The cloud-side device is further configured to determine multiple candidate frames of the detection target on the to-be-processed image according to the first information, where the first information may include depth information and position information of each millimeter wave detection point, and the position information is used to represent each The location of the mmWave detection point mapped on the image to be processed. The cloud-side device is also used to perform non-maximum suppression NMS processing on multiple candidate frames according to the depth information to output the target frame and the target millimeter wave detection point. The target frame is based on the depth information and position information of the target millimeter wave detection point. definite.

Optionally, in combination with the above sixth aspect, in a first possible implementation manner, the cloud-side device is specifically configured to perform non-maximum suppression NMS processing on multiple candidate frames according to the first score and the second score, and the first One score represents the probability that the detection target in each candidate frame belongs to each of the N categories determined by the classifier, where the N categories are preset categories, N is a positive integer, and the second score represents the depth The probability that the detection target in each candidate frame belongs to each of the N categories, determined by the first probability distribution between the information and each category.

Optionally, in combination with the first possible implementation manner of the sixth aspect, in the second possible implementation manner, the cloud-side device is further configured to perform statistics on the data in the first set, and determine the corresponding data of each category. The probability distribution of the first size of the statistical objects, the first set may include a plurality of statistical objects corresponding to each category, and size information of each statistical object. The first probability distribution is determined according to the probability distribution of the first size and the first relationship, where the first relationship is the relationship between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target.

Optionally, in combination with the second possible implementation manner of the sixth aspect, in the third possible implementation manner, the cloud-side device is further configured to perform statistics on the data in the second set, and determine the corresponding data of each category. The probability distribution of the second size of the statistical objects, the second set may include a plurality of statistical objects corresponding to each category, and size information of each statistical object. The second probability distribution is determined according to the second size distribution and the second relationship, the second probability distribution is used to update the first probability distribution, and the second relationship is the difference between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target relationship between.

Optionally, in combination with the second possible implementation manner of the sixth aspect or the third possible implementation manner of the sixth aspect, in a fourth possible implementation manner, the size information is height information of the statistical target.

Optionally, in combination with the sixth aspect or the fourth possible implementation manner of the sixth aspect from the first to the sixth aspect, in the fifth possible implementation manner, the location information is used in combination with the millimeter wave detection point on the vehicle. The distribution characteristics of , determine the position of the candidate frame in the image to be processed.

Optionally, in combination with the sixth aspect or the fifth possible implementation manner of the sixth aspect or the first to sixth aspects, in the sixth possible implementation manner, the depth information is used to determine the size of the candidate frame, and the candidate frame The size of is negatively correlated with depth information.

Optionally, in combination with the sixth aspect or the sixth possible implementation manner of the sixth aspect from the first to the sixth aspect, in the seventh possible implementation manner, the cloud-side device is also used for processing by Faster-RCNN. The processed image is processed to obtain a first feature map of the to-be-processed image. The second feature maps corresponding to the plurality of candidate frames are extracted from the first feature map. The second feature map is processed through a regression network and a classifier to obtain a first result, and the first result is used for non-maximum suppression NMS processing.

Optionally, in combination with the sixth aspect or the seventh possible implementation manner of the sixth aspect or the first to sixth aspects, in the eighth possible implementation manner, the end-side device acquires the image to be processed through a visual sensor, and the visual The sampling frequency of the sensor is the first frequency, the sampling frequency of the millimeter wave radar is the second frequency, and the difference between the first frequency and the second frequency is not greater than a preset threshold.

A seventh aspect of the present application provides a model training method, which may include: acquiring a training image and multiple millimeter-wave detection points, data obtained from the training image and multiple millimeter-wave detection points synchronously for the same detection target, each millimeter-wave detection point Depth information can be included, and the depth information is used to indicate the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to obtain multiple millimeter-wave detection points. Map multiple mmWave probe points onto the training image. A plurality of candidate frames of the detection target on the training image are determined according to the first information. The first information may include depth information and position information of each millimeter wave detection point, and the position information is used to indicate that each millimeter wave detection point is mapped on the training image. on the location. The model is trained according to the feature maps corresponding to multiple candidate boxes.

Optionally, in combination with the above seventh aspect, in a first possible implementation manner, the position information is used to determine the position of the candidate frame in the training image in combination with the distribution characteristics of the millimeter wave detection points on the vehicle.

Optionally, in combination with the seventh aspect or the first possible implementation manner of the seventh aspect, in the second possible implementation manner, the depth information is used to determine the size of the candidate frame, and the size of the candidate frame is negatively correlated with the depth information. .

Optionally, in combination with the seventh aspect or the first possible implementation manner of the seventh aspect or the second possible implementation manner of the seventh aspect, in the third possible implementation manner, convolution processing may also be performed on the training image to obtain: The first feature map of the training image. The second feature maps corresponding to the plurality of candidate frames are extracted from the first feature map, and the model is trained according to the second feature maps.

Optionally, in combination with the seventh aspect or the third possible implementation manner of the seventh aspect or the first to seventh aspects, in a fourth possible implementation manner, the training image is acquired by the visual sensor, and the sampling frequency of the visual sensor is is the first frequency, the sampling frequency of the millimeter wave radar is the second frequency, and the difference between the first frequency and the second frequency is not greater than a preset threshold.

Description of drawings

Figure 1a is a schematic flowchart of the fusion of detection results at the target level;

Figure 1b is a schematic flowchart of feature-level fusion;

Figure 2 is a schematic diagram of the detection performance of heterogeneous sensors in different dimensions;

3 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;

4 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present application;

5 is a schematic diagram of an efficient regional convolutional neural network;

6 is a schematic flowchart of a target determination method provided by an embodiment of the present application;

7a is a schematic diagram of an application scenario of a target determination method provided by the present application;

7b is a schematic diagram of an application scenario of another target determination method provided by the present application;

7c is a schematic diagram of an application scenario of another target determination method provided by the present application;

7d is a schematic diagram of an application scenario of another target determination method provided by the present application;

FIG. 7e is a schematic diagram of an application scenario of another target determination method provided by the present application;

8 is a schematic flowchart of another target determination method provided by an embodiment of the present application;

9a is a schematic diagram of an application scenario of another target determination method provided by the present application;

9b is a schematic diagram of an application scenario of another target determination method provided by the present application;

10 is a schematic diagram of the probability distribution of the first size provided by the application;

11 is a schematic flowchart of another target determination method provided by an embodiment of the present application;

12 is a schematic diagram of an application scenario of a target determination method provided by the present application;

Fig. 13 is the effect comparison diagram of the scheme provided by the embodiment of the present application and other schemes;

14 is a schematic flowchart of a model training method provided by an embodiment of the present application;

15 is a schematic structural diagram of a target determination device provided by the application;

16 is a schematic structural diagram of a model training device provided by the application;

17 is a schematic structural diagram of another target determination device provided by the application;

FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the present application.

detailed description

The embodiments of the present application will be described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Those of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

In order to better understand the applicable fields and scenarios of the solution provided by this application, before the specific introduction of the technical solution provided by this application, the related knowledge of multi-sensor information fusion is first introduced.

Multi-sensor information fusion (MSIF) is to use computer technology to automatically analyze and synthesize information or data from multiple sensors or multiple sources under certain criteria to complete the required decision-making and estimation. information processing process. The definition of sensor data fusion can be summarized as synthesizing the local data resources provided by multiple sensors of the same or different types distributed in different locations, and using computer technology to analyze them to eliminate the possible redundancy and redundancy between multi-sensor information. Contradictions, complement each other, reduce their uncertainty, and obtain a consistent interpretation and description of the measured target, thereby improving the rapidity and correctness of system decision-making, planning, and response, and enabling the system to obtain more adequate information. In this application, the same type of sensor is sometimes referred to as a homogeneous sensor, and a different type of sensor is referred to as a heterogeneous sensor. When the difference between the two is not emphasized, the two have the same meaning. In addition, it should be noted that this application sometimes refers to multi-sensor information fusion as multi-sensor data fusion, or multi-sensor fusion, and when their differences are not emphasized, they mean the same thing.

Sensor information fusion can be used for information fusion at different levels, such as target-level detection result fusion (high-level fusion) and feature-level fusion (feature-level fusion). Among them, high-level fusion refers to the fusion of target-level detection results of multiple homogeneous or heterogeneous sensors after obtaining target-level detection results from the data of a single sensor. Feature-level fusion refers to the fusion of the extracted features of multiple homogeneous or heterogeneous sensors after the feature extraction of the measurement data of a single sensor to form the target-level detection result. 1a and FIG. 1b are described below. FIG. 1a is a schematic flowchart of target-level detection result fusion, and FIG. 1b is a schematic flowchart of feature-level fusion. As shown in Figure 1a, it is assumed that there are multiple sensors, namely the first sensor, the second sensor and the third sensor. The data acquired by the first sensor is processed by the first perception algorithm to output the first target-level detection result of the target. The data acquired by the second sensor is processed by the second perception algorithm to output the second target-level detection result of the target. The data acquired by the third sensor is processed by the third perception algorithm to output the third target-level detection result of the target. The first target level detection result, the second target level detection result and the third target level detection result are then fused. For the fusion of object-level detection results, each sensor independently processes the generated object data. Each sensor has its own independent perception. For example, lidar has the perception of lidar, camera has the perception of camera, and millimeter-wave radar will also make its own perception. After all sensors complete the target data generation, the main processor performs data fusion. As shown in Fig. 1b, it is assumed that there are multiple sensors, namely the first sensor, the second sensor and the third sensor. In the feature-level fusion scenario, there is only one perception algorithm that perceives the fused multi-dimensional comprehensive data. Since there is only one perception algorithm, the data acquired by each sensor needs to be synchronized in time and space. Among them, the synchronization of time is to ensure that the data collected by different sensors are synchronized in time, and the synchronization of space is to convert the measurement values of different sensors to the same coordinate system based on their respective coordinate systems, that is, the coordinate system Unite.

Although multi-sensor data fusion has not formed a complete theoretical system and effective fusion algorithm, many mature and effective fusion methods have been proposed in many application fields according to their specific application backgrounds. The common methods of multi-sensor data fusion can be basically summarized into two categories: random and artificial intelligence. The random methods include weighted average method, Kalman filter method, multi-Bayesian estimation method, evidence inference, production rules, etc.; The intelligent category includes fuzzy logic theory, neural network, rough set theory, expert system and so on.

It should be noted that to accurately fuse multi-sensor information, one-to-one matching of targets in different sensors must be performed, that is, multi-sensor target matching. After completing the multi-sensor target matching, the accurate information of the target can be obtained through fusion. This application also refers to the matching of the target as the association of sensor output data, and the two represent the same meaning unless the difference between the two is emphasized. The solution provided in this application focuses on how to ensure the correctness or accuracy of the correlation between the target-level detection results of heterogeneous sensors, so as to obtain better data fusion results, that is, to better ensure the subsequent output of robust detection results . It is generally believed that if the detection performance of a heterogeneous sensor in a certain dimension (or function) is good, the accuracy of the correlation between the two is generally higher, which will be described below with reference to FIG. 2 . As shown in Figure 2, it is a schematic diagram of the detection performance of heterogeneous sensors in different dimensions. Figure 2 shows 3 kinds of sensors, camera, millimeter wave radar, lidar, and the detection performance of these three kinds of sensors in 7 different dimensions, the 7 different dimensions are target detection, target recognition, distance measurement, object Edge detection, lane tracking, inclement weather and dark or heavily exposed functions. As can be seen from Figure 2, both millimeter-wave radar and lidar have good detection performance in target detection. For the function of target detection, the accuracy of data association between millimeter-wave radar and lidar will be higher. For another example, for the function of object edge detection, the accuracy of data association between millimeter-wave radar and lidar will be low. In addition, it can be seen from Figure 2 that for these seven functions, the detection performance of the camera and the millimeter-wave radar cannot be good in a certain dimension, so the accuracy of the data association between the camera and the millimeter-wave radar is usually low. But at the same time, the complementary effect of the measurement characteristics of the camera and the millimeter-wave radar is also very good. Therefore, how to correlate the detection results of cameras and millimeter-wave radars is of great significance.

The solution provided by this application needs to correlate the output data of heterogeneous sensors through a neural network. The following will involve a lot of knowledge related to neural networks. knowledge is introduced. It should be noted that the solution provided in this application does not limit the type of neural network, and any neural network that can be used for target detection can be used in the embodiments of this application.

Since the convolutional neural network (CNN) is a deep neural network with a convolutional structure, it is a deep learning (deep learning) architecture. Learning at multiple levels at the level of abstraction. As a deep learning architecture, a CNN is a feed-forward artificial neural network in which each neuron responds to overlapping regions in images fed into it.

As shown in FIG. 3 , a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .

Convolutional layer/pooling layer 120:

Convolutional layer:

As shown in FIG. 3, the convolutional/pooling layer 120 may include layers 121-126 as examples. In one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, layer 123 is a convolutional layer, and layer 124 is a convolutional layer. Layers are pooling layers, 125 are convolutional layers, and 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.

Taking the convolution layer 121 as an example, the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels...depending on the value of stride), which completes the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification... The dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .

The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.

When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (for example, 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network As the depth of the network 100 deepens, the features extracted by the later convolutional layers (eg 126) become more and more complex, such as features such as high-level semantics. Features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, that is, each layer 121-126 exemplified by 120 in Figure 3, which can be a convolutional layer followed by a layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. During image processing, the only purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the average value of the pixel values in the image within a certain range. The max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

Neural network layer 130:

After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.

After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 3 from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 3 from 140 to 110 as back propagation) will start to update The weight values and biases of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network. In a specific application, the convolutional neural network can also exist in the form of other network models, for example, such as The multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the extracted features are input to the neural network layer 130 for processing.

In a preferred embodiment, the neural network of the present application may adopt an efficient regional convolutional neural network (faster regions with convolution neural network, Faster-RCNN). The Faster RCNN target detection algorithm is a typical target detection algorithm. In the algorithm, for an input image, a multi-layer convolution layer is used to extract the basic feature map of the image. Based on the basic feature map, the Faster RCNN algorithm is used in the algorithm. The region proposal network (RPN) generates a large number of candidate boxes, and filters and filters a large number of candidate boxes, and only selects a fixed number of candidate boxes and inputs them into the next-level module; Carry out a deeper classification analysis, and finally obtain the final candidate box containing the target. It should be noted that the solution provided in this application does not generate a large number of candidate frames through the RPN network, which will be described later. The Faster RCNN will be introduced below in conjunction with Figure 5, which is a schematic diagram of an efficient regional convolutional neural network.

As shown in Figure 5, Faster RCNN can include four parts, namely convolution layer, RPN network, partition pooling layer (roi pooling), classification layer and regression network. Each of them will be described below. The convolutional layer has been introduced above. It is mainly used to extract the features of the picture. The input is the entire picture, and the output is the extracted features. The extracted features are generally called feature maps. The RPN network is used to recommend candidate regions. The input is a picture, and the output is multiple candidate regions. It should be noted that the solution provided in this application does not output candidate regions through the RPN network, which will be described later. In addition, it should be noted that in this application, a candidate region is sometimes referred to as a candidate frame, and when the difference between the two is not emphasized, the two have the same meaning. The process of roi pooling can be understood as the process of pooling candidate regions. When feature extraction is performed on the original image, the corresponding first feature map will be extracted. Then the corresponding candidate area will be mapped on the first feature map, and this mapping process is part of roi pooling. Generally, the process of max pooling is also carried out, and then a second feature map is obtained, which is sent to continue the calculation later. The second feature map is the feature map corresponding to the candidate region. The classification layer and the regression network further process the second feature map, and output the class to which the candidate region belongs and the position of the candidate region in the image through the classification layer and the regression network.

The solution provided in this application may include two parts, the "inference" process and the "training" process. They are introduced separately below.

First, the reasoning process - goal determination method.

FIG. 6 is a schematic flowchart of a method for determining a target according to an embodiment of the present application.

As shown in FIG. 6 , a target determination method provided by an embodiment of the present application may include the following steps:

601. Acquire an image to be processed.

The solution provided in this application can be applied to various scenarios, specifically, the method shown in FIG. 6 can be applied to scenarios such as the field of automatic driving and the field of monitoring.

When the method shown in FIG. 6 is applied to the scene in the field of automatic driving, the image to be processed in step 601 may be an image obtained by the vehicle through a visual sensor, and specifically, the image to be processed may be captured by the vehicle through a camera installed on the vehicle Image.

When the method shown in FIG. 4 is applied to a monitoring scene, the image to be processed in step 601 may be an image acquired by a visual sensor installed on the roadside, specifically, the image to be processed may be an image captured by a camera installed on the roadside.

The solution provided in this application can obtain images to be processed through a visual sensor. It should be noted that this application also sometimes refers to a visual sensor as a camera. When the difference between the two is not emphasized, the two have the same meaning.

In one possible implementation, the vision sensor may include a lens and an image sensor. The optical image generated by the lens is projected onto the image sensor, and the image sensor converts it into an electrical signal, and then through the analog-to-digital (A/D) conversion and other processing processes, the image to be processed is obtained. The visual sensor can be in any of the following specific forms, for example, a camera, a video camera, a camera, a scanner, or other devices with a camera function (for example, a mobile phone, a tablet computer, etc.).

602. Acquire multiple millimeter wave detection points.

The multiple millimeter-wave detection points and the images to be processed are data obtained synchronously. Each millimeter-wave detection point includes depth information, and the depth information is used to indicate the distance between the detection target and the millimeter-wave radar. The detection target can be any target such as vehicles, people, trees, etc. The data acquired synchronously can be understood as the millimeter-wave radar and the image sensor simultaneously collect data, or it can be understood as the deviation of the frame rate of the data collected by the millimeter-wave radar and the image sensor within a preset range. For example, for the same detection target, the millimeter-wave radar collects the millimeter-wave detection points according to the first frame rate, and the image sensor collects the image to be processed according to the second frame rate, and the deviation between the first frame rate and the second frame rate is less than the preset threshold, namely It can be considered that the millimeter-wave radar and the image to be processed are data acquired synchronously. The millimeter-wave radar emits high-frequency millimeter waves, which are collected by the receiving system after being reflected by the target, and the distance to the target is determined by frequency measurement, thereby forming multiple millimeter-wave detection points.

When the method shown in FIG. 6 is applied to the scene in the field of automatic driving, the millimeter wave detection point in step 602 may be the data obtained by the millimeter wave radar installed on the vehicle.

When the method shown in FIG. 6 is applied in a monitoring scenario, the millimeter-wave detection point in step 602 may be data obtained by a millimeter-wave radar on a monitoring device installed on the road.

In this application, depth information is also sometimes referred to as distance information, and both represent the distance between the target acquired by the millimeter-wave radar and the millimeter-wave radar when the difference between the two is not emphasized.

603. Map the multiple millimeter wave detection points to the image to be processed.

The present application can implement the mapping of multiple millimeter wave detection points to the image to be processed in various ways. By way of example, a method of mapping multiple millimeter wave detection points to the image to be processed is given below. It should be noted that the embodiments of the present application can all be used in the related art in which a plurality of millimeter wave detection points can be mapped onto the image to be processed.

Mapping multiple millimeter-wave detection points onto the image to be processed, that is, spatial fusion of the millimeter-wave radar and visual sensor data. Specifically, the millimeter-wave detection points acquired by the millimeter-wave radar can be mapped through the unification of the coordinate system. to the image to be processed acquired by the vision sensor. The millimeter-wave detection point determined by the millimeter-wave radar and the target determined by the vision sensor must be in the same coordinate system for better correlation and matching.

Assume that the visual sensor coordinate system is (Xc, Yc, Zc), the millimeter-wave radar coordinate system is (Xr, Yr, Zr), and the three-dimensional world coordinate system is (Xw, Yw, Zw).

The coordinate system where the millimeter-wave radar is located can be used as the benchmark to set the coordinate system where the millimeter-wave radar is located to coincide with the world coordinate system, which can be expressed by the following formula:

Map the image data in the vision sensor coordinate system to the world coordinate system to obtain the coordinates of the image data in the vision sensor coordinate system in the world coordinate system, which can be expressed by the following formula:

f represents the focal length of the visual sensor, (u0, v0) represents the principal point of the visual sensor, dx, dy represent the pixel unit size of the visual sensor in the x and y directions, respectively, [-a,-b,0] ^T represents the visual sensor The translation vector between the millimeter-wave radar and the installation position of the millimeter-wave radar, and θ represents the rotation angle between the millimeter-wave radar and the vision sensor. According to the above formula 1-1 and formula 1-2, the coordinates of the millimeter-wave radar can be converted into the coordinates of the vision sensor, and the detection points of the millimeter-wave radar can be mapped to the image to be processed.

604. Determine multiple candidate frames of the image to be processed according to the depth information and position information of each millimeter wave detection point.

The location information is used to indicate where each mmWave detection point is mapped to the image to be processed.

A set of candidate frames can be determined according to the depth information and position information of each millimeter wave detection point, and the set of candidate frames includes multiple candidate frames. The following describes how to determine the candidate frame according to the depth information and the position information, respectively.

In the solution provided in this application, the size of the candidate frame is determined according to the depth information of the millimeter wave detection point. The solution provided in this application uses the principle of pinhole imaging, that is, the closer the object distance is, the larger the image will be, and the farther the object distance will be, the smaller the image will be. According to the principle of pinhole imaging, the larger the depth information of the millimeter wave detection point, the smaller the size of the candidate frame, and the smaller the depth information of the millimeter wave detection point, the larger the size of the candidate frame. In addition, the solution provided in this application can set a priori box. Multiple a priori frames can be set. Specifically, multiple areas with different sizes or aspect ratios can be set as a priori frames. The candidate frame is based on these a priori frames, which reduces the difficulty of training to a certain extent. The size of the prior frame may be determined according to the size of the preset category. For example, the solution provided in this application can identify three categories, namely trucks, cars and buses, and can obtain the average size of trucks, the average size of cars and the average size of buses through a large number of statistical data. Then for each millimeter-wave detection point, at least three sizes of a priori frames can be determined, and then when the size of the candidate frame is determined according to the depth information, each of the three sizes can be determined according to the depth information of the millimeter-wave detection point. The size of the prior box is adjusted.

In the solution provided in this application, the position of the candidate frame is determined according to the position of the millimeter wave detection point mapped to the image to be processed. In other words, the position of the candidate frame is determined according to the position of each millimeter wave detection point on the image to be processed. The solution provided in this application determines the position of the candidate frame according to the distribution characteristics of the millimeter wave detection points and the positions of the millimeter wave detection points on the image to be processed. Among them, the distribution characteristics of millimeter wave detection points may present different distribution characteristics in different scenarios. In a possible implementation manner, for each possible application scenario, the distribution characteristics of the millimeter wave detection points in a certain application scenario can be obtained through a large number of experimental statistics, which are described below with an example. Assuming that the solution provided in this application is applied in the field of automatic driving, it is necessary to obtain the distribution characteristics of millimeter wave detection points on the vehicle. For example, a vehicle can be placed in a clean background environment (a clean background environment can be understood as in addition to the vehicle in the scene, minimizing other objects around), the millimeter-wave radar emits high-frequency millimeter waves, which are reflected by the vehicle and collected by the receiving system , to get one statistic. The high-frequency millimeter waves are transmitted multiple times through the millimeter-wave mine, and multiple data are counted for the vehicle, or different vehicles can be replaced, or different numbers of vehicles can be added, and multiple statistics can be performed to obtain the millimeter-wave detection points on the vehicle. distribution characteristics. For another example, according to the application requirements of different scenarios, the distribution characteristics of millimeter wave detection points on people can be obtained, or the distribution characteristics of millimeter wave detection points on animals can also be obtained, or the distribution characteristics of millimeter wave detection points on goods ( such as the distribution characteristics on the shipping box).

In order to better understand how the solution provided in this application determines the candidate frame according to the depth information and the position information, the following takes an automatic driving scenario as an example to illustrate.

7a to 7c are schematic diagrams of application scenarios of a target determination method provided by the present application. As shown in Figure 7a, it is a schematic diagram of the acquired millimeter wave detection points when the detection target is a vehicle. Each millimeter-wave detection point contains the distance between the target and the millimeter-wave radar. Some of these points are due to noise generated by multipath reflection or ray tracing, but these points also contain distance information. As shown in FIG. 7b, how to determine the candidate frame according to the position information is described by taking a millimeter wave detection point in FIG. 7a as an example. It should be noted that the principle of the process of determining the candidate frame according to the position information of each millimeter wave detection point is the same, and the description thereof will not be repeated one by one. As shown in Figure 7b, it is assumed that the relationship between the vehicle and the millimeter-wave detection points is determined according to the distribution characteristics of the millimeter-wave detection points on the vehicle, and the millimeter-wave detection point is generally located in the lower left corner of the target vehicle. The lower left corner of the inspection frame determines a plurality of a priori frames, wherein the number of the plurality of a priori frames is determined according to the category of the pre-specified target. As shown in Figure 7b, it is assumed that there are three categories in advance, namely the first category, the second category and the third category, and the average size of the first category is obtained through a large number of statistical data as the first size, the second category The average of the sizes is the second size, and the average of the sizes of the third category is the third size. Then, for the millimeter wave detection point shown in Fig. 7b, a total of three prior frames of different sizes can be obtained. Another example is shown in 7c. Assuming that the relationship between the vehicle and the millimeter-wave detection point is determined according to the distribution characteristics of the millimeter-wave detection point on the vehicle, the millimeter-wave detection point is generally located below the target vehicle, and the millimeter-wave detection point can be a priori. The bottom of the box determines a number of prior boxes. In a possible implementation manner, a plurality of a priori frames may be determined by the central position of the millimeter wave detection point at the lower side of the a priori frame. The number and size of the a priori frames can be understood according to the description in FIG. 7a. Repeat the description again. In a possible implementation, it is assumed that the relationship between the vehicle and the millimeter-wave detection points is determined according to the distribution characteristics of the millimeter-wave detection points on the vehicle, and the millimeter-wave detection points are generally located below the target vehicle, then as shown in Figure 7d, it is possible to A plurality of a priori frames are determined at the middle positions of the millimeter wave detection points below the prior frame.

A number of a priori boxes are determined at the lower left corner of the prior box with the millimeter wave detection point. It can be seen from Figures 7c to 7d that the solution provided by the present application can determine multiple prior frames according to the distribution characteristics of the millimeter wave detection points on a certain target, such as the distribution characteristics on a vehicle. It should be noted that, in FIGS. 7a to 7d , multiple a priori frames are determined with the millimeter wave detection point at the lower left of the prior frame, and multiple a priori frames are determined with the millimeter wave detection point at the lower left corner of the prior frame. It is only a preferred solution of the solution provided in this application. In some possible implementations, other methods of determining the a priori frame may also be selected according to the distribution characteristics of the millimeter wave detection points on the target. frame, or a plurality of candidate frames can be determined at any position on the left of the prior frame according to the millimeter wave detection point. The present application determines the position of the candidate frame according to the distribution characteristics of the millimeter wave detection points on the target, so that the target can be better selected, in other words, the millimeter wave detection point can be better associated with the position of the target.

In a possible implementation, when the millimeter-wave radar is installed at the position of the front bumper of an autonomous vehicle, the application obtains through a large number of experiments that the millimeter-wave detection points are mostly distributed on the bottom and sides of the vehicle.

As shown in FIG. 7e , it is a schematic diagram of an application scenario provided by an embodiment of the present application. As shown in Fig. 7e, taking two millimeter-wave detection points as an example, the determination of the candidate frame according to the depth information of the millimeter-wave detection points is described. As shown in Figure 7e, assuming that the depth information of millimeter-wave detection point A is smaller than that of millimeter-wave detection point B, the size of the candidate frame determined according to millimeter-wave detection point A should be smaller than the size of the candidate frame determined according to millimeter-wave detection point B size of. Regarding the negative correlation between the depth information and the size of the candidate frame, it can be understood with reference to the principle of pinhole imaging, which will not be described in this application.

605. Perform non-maximun suppression (NMS) processing on the plurality of candidate frames according to the depth information, so as to output the target frame and the target millimeter wave detection point corresponding to the detection target in the target frame.

In other words, NMS processing is performed on multiple candidate frames according to the depth information to output the target frame and the target millimeter wave detection point, and the target frame is determined according to the depth information and position information of the target millimeter wave detection point. This application takes one target as an example for description, but it should be noted that when there are multiple targets, the solutions provided in this application are also applicable.

Non-maximum suppression is the suppression of elements that are not maximal. The main purpose of this method is to reduce the number of candidate frames. In step 604, a large number of candidate frames can be determined according to the depth information and position information of the millimeter wave detection points. Probability value, each candidate frame will also correspond to the depth value of a millimeter wave detection point. The redundant candidate frame can be removed by the NMS method to determine the final candidate frame. It should be noted that this application sometimes refers to the final candidate frame as the target frame. When the difference between the two is not emphasized, both of them represent the frame output after being processed by the NMS method, and the frame is used to represent the location of the target.

The input of NMS is N candidate boxes that have been sorted according to the score from high to low. When there are multiple targets, multiple candidate boxes will be output. For example, M candidate boxes with the highest score and unsuppressed will be output, among which N is a positive integer greater than M, where the score of the candidate box is determined according to the depth information. For example, suppose the target includes 3 categories, namely the first category, the second category and the third category. It is assumed that the probability distribution between the size of the target corresponding to the first category and the depth information is determined through a large number of statistics or through neural network learning (assumed to be the A probability distribution), and the size of the target corresponding to the second category and the depth information. The probability distribution (assumed to be the B probability distribution), the probability distribution between the size of the target corresponding to the third category and the depth information (assumed to be the C probability distribution), then according to the depth information of the target in the candidate frame and the A probability distribution, B The relationship between the probability distribution and the C probability distribution can determine the probability that the target in the candidate box belongs to a certain category.

In order to better understand the solution provided by this application, an example is given below for performing NMS processing on multiple candidate frames according to depth information. Assume that there are 6 candidate frames. For each category, the probability of each candidate frame belonging to the category is sorted according to the depth information. Suppose, for a certain category, the probability of belonging to the category from small to large is A< B<C<D<E<F. Then start from the maximum probability candidate frame F, and judge whether the intersection over union (IOU) of candidate frames A, B, C, D, E and F is greater than a certain threshold. IOU can be used to represent two The degree of overlap of the candidate boxes. Assuming that the overlapping degree of candidate frames B, D and F exceeds the threshold, then the candidate frames B and D are discarded, and the first candidate frame F is marked, which is retained. From the remaining candidate frames A, C, and E, select the candidate frame E with the highest probability, and then judge the degree of overlap between the candidate frames A, C, and E. If the degree of overlap is greater than a certain threshold, then discard it; and mark the candidate frame E. is the second candidate box we keep. Repeat this process to find all the remaining candidate boxes, which are the final candidate boxes. And output the millimeter wave detection point corresponding to the detection target in the final candidate frame.

It can be seen from the embodiment corresponding to FIG. 6 that multiple candidate frames of the image to be processed are determined by the position information and depth information of the millimeter wave detection points, and NMS processing is performed on the multiple candidate frames according to the depth information. When the final candidate frame is determined. , the millimeter wave detection points associated with the candidate frame can be output to improve the accuracy of target matching.

It can be seen from the embodiment corresponding to FIG. 6 that NMS processing may be performed on multiple candidate frames according to depth information, and in some possible implementations, NMS processing may also be performed on multiple candidate frames according to depth information combined with other information. In addition, there may also be various ways to determine the probability distribution between the depth information and a certain category. Based on the embodiment corresponding to FIG. 6 , the embodiment corresponding to FIG. 6 is further refined or expanded below.

FIG. 8 is a schematic flowchart of another target determination method provided by an embodiment of the present application.

As shown in FIG. 8 , another target determination method provided by this embodiment of the present application may include the following steps:

801. Acquire an image to be processed.

802. Acquire multiple millimeter wave detection points.

803. Map the multiple millimeter wave detection points to the image to be processed.

804. Determine multiple candidate frames of the image to be processed according to the depth information and position information of each millimeter wave detection point.

Steps 801 to 804 can be understood with reference to steps 601 to 604 in the embodiment corresponding to FIG. 6 , and details are not repeated here.

805. Perform NMS processing on the multiple candidate frames according to the first score and the second score.

The first score represents the probability determined by the classifier that the detection target in each candidate frame belongs to each of the N categories, where the N categories are preset categories, and N is a positive integer. The second score represents the probability that the detection target in each candidate frame belongs to each of the N categories, determined according to the depth information and the first probability distribution between each category.

In a possible implementation, the input of the NMS is N candidate boxes that have been sorted from high to low score, and the output M candidate boxes with the highest score and not suppressed, where N is a positive integer greater than M, The score of the candidate frame is determined according to the product of the first score and the second score.

The embodiment of the present application does not limit the type of the classifier. The classifier will score each input candidate box. The higher the score, the higher the probability that there is a target of the corresponding category in the candidate box. Regarding the determination of the score of each candidate frame according to the classifier in the related art, the embodiments of the present application can all be adopted. For the candidate frames processed by the regression network, if NMS processing is performed on multiple candidate frames only according to the first score, the situation shown in Figure 9a may occur, that is, there are a large number of repetitions and interferences in the results. By introducing the depth information of the millimeter wave detection point, the method adds a dimension to determine the millimeter wave detection point corresponding to the candidate area, so as to improve the accuracy of the association and also improve the accuracy of the target detection. As shown in FIG. 9a , assuming that the candidate frames are sorted from high to low according to the first score, there may be 3 millimeter wave detection points associated with the final output candidate frame after processing according to the first NMS. By comparing the depth information of the three millimeter-wave detection points, it is assumed that the depth information of the millimeter-wave detection point A has the highest probability of corresponding to the category corresponding to the candidate frame, then after NMS processing, as shown in Figure 9b, the final output Candidate boxes and B mmWave detection points. The description in this paragraph is for the convenience of understanding that after the introduction of depth information, the accuracy of the association can be improved. In a possible implementation, the score of each candidate frame may be determined according to the following formula, that is, the score of each candidate frame may be determined according to the first score and the second score. The input of NMS is N candidate boxes sorted from high to low according to the scores determined by the first score and the second score, and the M final candidate boxes with the highest scores and no suppression are output. The score of each candidate box can be expressed by the following formula:

score=p(depth)=∑ _classes p(depth, classes)=∑ _classes p(depth|classes)p(class)

p(class)=softmax(calsses)

p(depth|class)～N(mean _{height(class)} ，std _{height(class)} )

Among them, score represents the score determined according to the first score and the second score, depth represents the depth information of each millimeter wave detection point, classes represents the type of the target, and the number of types is preset, which has been explained above, here It will not be repeated. p(A, B) represents the probability of A and B occurring at the same time, that is, the probability distribution between depth information and categories, p(A|B) represents the probability of A occurring under the probability of B occurring, that is, a category corresponds to The probability distribution of the depth information, mean represents the mean value, and std represents the standard deviation. N stands for Gaussian distribution.

It can be seen from the embodiment corresponding to FIG. 8 that the solution provided by this application performs NMS processing on multiple candidate frames through the first score and the second score, so as to improve the accuracy of data association, that is, to improve the one-to-one matching of the same target in different sensors. corresponding accuracy.

The following describes how to determine the probability distribution between the depth information and a certain category on the basis of the embodiments corresponding to FIG. 6 and FIG. 8 .

In a possible implementation, statistics are performed on the data in the first set to determine the probability distribution of the first size of the statistical target corresponding to each category, the first set includes a plurality of statistical targets corresponding to each category, and each Size information of a statistic object. The first probability distribution is determined according to the probability distribution of the first size and the first relationship, where the first relationship is the relationship between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target. For example, it is assumed that the first set includes 3 categories, namely trucks, cars and buses. The samples corresponding to trucks, that is, the statistical targets corresponding to trucks are 1000, the statistical targets corresponding to cars are 1000, and the samples corresponding to buses are The statistical target is 1000. For each statistical object, the statistical object includes size information. For example, assuming a statistical target A among 1000 statistical targets corresponding to a truck, A includes size information, such as A's physical size, or A's length information, or at least one of A's width information or A's height information. According to the category of each statistical target, the size information of each statistical target can obtain the probability distribution of the first size, that is, the probability distribution of the size of the statistical target under each category. As shown in FIG. 10 , it shows a schematic diagram of the probability distribution of the first size when the size information is the height information of the target. In addition, the relationship between the depth information and the size of the target can be determined through the principle of pinhole imaging. In a possible implementation, the distance between the target and the millimeter-wave radar can be adjusted multiple times to obtain the relationship between the depth information and the size of the target. Relationship. When the relationship between the depth information and the size of the target, and the probability distribution of the size of the target under each category are obtained, the probability distribution between the depth information and each category can be determined.

In a possible implementation manner, the first set may be updated, and the probability distribution between the depth information and each probability category may be determined through the updated set. For example, statistics may be performed on the data in the second set to determine each The second size distribution of the statistical objects corresponding to the category, the second size distribution is used to update the first size distribution, and the second set includes a plurality of statistical objects corresponding to each category, and size information of each statistical object. The first probability distribution is determined according to the second size distribution and the second relationship, where the second relationship is the relationship between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target.

It should be noted that, in addition to the steps described in detail above, the solution provided by the present application may further include some other steps, which are not limited in the embodiments of the present application. The following description will be given with reference to a specific embodiment.

FIG. 11 is a schematic flowchart of another target determination method provided by an embodiment of the present application.

As shown in Figure 11, the image is collected by the vision sensor, and the millimeter-wave detection points are acquired by the millimeter-wave radar, and the millimeter-wave detection points and the image frames in the video are time-aligned. In a possible implementation manner, the sampling frequency of the visual sensor is the first frequency, the sampling frequency of the millimeter wave radar is the second frequency, and the difference between the first frequency and the second frequency is not greater than a preset threshold. Map the millimeter wave detection points to the image collected by the vision sensor, and then input the image mapped with the millimeter wave detection points into the convolutional neural network. The first feature map of the image can be obtained through the convolutional neural network. The position of the point on the image and the depth information of the millimeter wave detection point are used to generate multiple candidate frames, and a second feature map corresponding to the multiple candidate frames is extracted from the first feature map. The second feature map is processed by the classification layer and the regression layer, and the second feature map processed by the classification layer and the regression layer is processed by NMS according to the depth information, so as to output the detection result of the image and the detection result associated with the detection result. Millimeter wave detection point.

FIG. 12 is a schematic diagram of an application scenario of a target determination method provided by the present application. As shown in Figure 12, the solution provided by this application can be applied in the field of automatic driving. When the solution provided by this application is applied in the field of automatic driving, objects on the road can be detected and recognized, such as detection and recognition of vehicles on the road. , which can detect the position of the vehicle within the image range obtained by the vision sensor, the type of the vehicle and the distance between the vehicle and the own vehicle. Among them, the self-vehicle refers to the vehicle on which the visual sensor is installed. As shown in Figure 12, for each target, the detection results of the vision sensor and the detection results of the millimeter wave detection points are matched and associated one by one. It should be noted that the solution provided in this application can be applied in any scenario where the target-level detection results of the vision sensor and the millimeter-wave radar need to be correlated. For example, the solution provided by this application can be applied in a monitoring scenario. When the solution provided by this application is applied in a monitoring scenario, objects in the monitoring area can be detected and identified, for example, vehicles or people in the monitoring area can be detected. And identification, for each target, the detection results of the vision sensor and the detection results of the millimeter wave detection points are matched and associated one by one.

Referring to FIG. 13 , it includes a comparison between the association result determined by the target determination method provided by the present application and the result of the first solution, which is a method without performing NMS processing according to depth information. The application can be tested through a self-built database, wherein the self-built database can include a plurality of images that should be provided with millimeter wave detection points, and the images are marked with manual classification. AP ₅₀ means that the ratio of the final candidate frame in the image and the target object to be detected coincides with 50%. It can be understood that, in the method provided by the present application, NMS processing is performed according to the first score determined by the classifier and the second score determined according to the depth information, which improves the accuracy of target detection. Therefore, it can be seen from FIG. 13 that, in the target determination method provided by the present application, the AP ₅₀ output by the convolutional neural network for recognizing pictures is significantly higher than that of the first solution. Therefore, the target determination method provided by the present application can also improve the accuracy of target detection.

Second, the training process - a model training method.

FIG. 14 is a schematic flowchart of a model training method provided by an embodiment of the present application.

As shown in FIG. 14 , a model training method provided by an embodiment of the present application may include the following steps:

1401. Obtain training data.

The training data consists of multiple training images mapped with mmWave detection points. Training images and mmWave detection points are data acquired simultaneously for the same target. In the present application, the target is sometimes referred to as the target object, and the two have the same meaning unless the difference between the two is emphasized.

The training image carries the label information of the target object, and the label information of the target object can be obtained by manual annotation. The training image is also the original image used to train the target detection model, and the label information of the target object can be understood as the ground truth (GT) used to train the target detection model.

How to map the millimeter wave detection points to the training image can be understood by referring to step 1402 in the embodiment corresponding to FIG. 6 to map multiple millimeter wave detection points to the image to be processed, which will not be repeated here.

1402. Determine multiple candidate frames of the image to be processed according to the depth information and position information of each millimeter wave detection point.

A set of candidate frames can be determined according to the depth information and position information of each millimeter wave detection point, and the set of candidate frames includes multiple candidate frames. It can be understood with reference to step 604 in the embodiment corresponding to FIG. 6 , and details are not repeated here.

1403. Train the model according to the features corresponding to the multiple candidate frames, to obtain a trained model.

The training data can be input into the model, for example, the god model can be fast R-CNN, faster R-CNN, mask region-based convolutional neural network (Mask R-CNN) and so on. The first feature map of the training data can be obtained through the model, multiple candidate frames are generated according to the position of the millimeter wave detection point on the image and the depth information of the millimeter wave detection point, and the first feature map corresponding to the multiple candidate frames is extracted from the first feature map. Two feature maps. The model may be trained according to the second feature map, and it may be determined that the model training is completed until the loss function of the model converges.

The flow of the target determination method and the model training method provided by the present application has been introduced in detail. The target determination device and the model training device provided by the present application are described below based on the aforementioned target determination method and model training method. The target determination device uses In executing the steps of the method corresponding to the foregoing FIGS. 6-12 , the model training apparatus is configured to execute the steps of the foregoing method corresponding to FIG. 14 .

Referring to FIG. 15 , a schematic structural diagram of a target determination device provided by the present application. The target determination device includes:

The acquisition module 1501 is configured to acquire an image to be processed and multiple millimeter wave detection points, where the image to be processed and the multiple millimeter wave detection points are data obtained synchronously for the same target. Among them, the image to be processed can be acquired by a vision sensor, and the millimeter wave detection point can be acquired by a millimeter wave radar. Each millimeter-wave detection point may include depth information, where the depth information is used to represent the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to acquire multiple millimeter-wave detection points. The mapping module is configured to map the multiple millimeter wave detection points acquired by the acquisition module 1501 to the image to be processed acquired by the acquisition module 1501 . The processing module 1502 is used to determine a plurality of candidate frames of the detection target on the image to be processed according to the first information. The first information may include depth information and position information of each millimeter wave detection point, and the position information is used to represent each millimeter wave. The location of the wave detection point mapped on the image to be processed. The processing module 1502 is also used to perform non-maximum suppression NMS processing on multiple candidate frames according to the depth information, so as to output the target frame and the target millimeter wave detection point (that is, output the target-level detection of the same detection target and different sensors). result), that is, output the association result. The target frame is determined according to the depth information and position information of the target millimeter wave detection point.

In a possible implementation manner, the processing module 1502 is specifically configured to: perform non-maximum suppression NMS processing on multiple candidate frames according to a first score and a second score, where the first score represents each The probability that the detection target in the candidate box belongs to each of the N categories, N categories are preset categories, N is a positive integer, and the second score represents the first probability between each category according to the depth information The probability that the detection target in each candidate box belongs to each of the N categories, determined by the distribution.

In a possible implementation manner, the target determination device may further include a statistics module 1503, the statistics module 1503 is configured to perform statistics on the data in the first set, and determine the probability distribution of the first size of the statistical target corresponding to each category, The first set may include a plurality of statistical objects corresponding to each category, and size information of each statistical object. The first probability distribution is determined according to the probability distribution of the first size and the first relationship, where the first relationship is the relationship between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target.

In a possible implementation manner, the statistics module 1503 is further configured to: perform statistics on the data in the second set, and determine the probability distribution of the second size of the statistical target corresponding to each category, and the second set may include each category Corresponding multiple statistical targets, and size information of each statistical target. The second probability distribution is determined according to the second size distribution and the second relationship, the second probability distribution is used to update the first probability distribution, and the second relationship is the difference between the size of the statistical target and the depth information of the millimeter wave detection point corresponding to the statistical target relationship between.

In a possible implementation, the size information is height information of the statistical target.

In a possible implementation, the position information is used to determine the position of the candidate frame in the image to be processed in combination with the distribution characteristics of the millimeter wave detection points on the vehicle.

In a possible implementation, the depth information is used to determine the size of the candidate frame, and the size of the candidate frame is negatively correlated with the depth information.

In a possible implementation manner, the processing module 1502 is further configured to: perform convolution processing on the image to be processed to obtain a first feature map of the image to be processed. The second feature maps corresponding to the plurality of candidate frames are extracted from the first feature map. The second feature map is processed through a regression network and a classifier to obtain a first result, and the first result is used for non-maximum suppression NMS processing.

Referring to FIG. 16 , a schematic structural diagram of a model training apparatus provided by the present application. The model training device includes:

The acquiring module 1601 is configured to perform step 1401 in the embodiment corresponding to FIG. 14 .

The training module 1602 is configured to perform steps 1402 and 1403 in the embodiment corresponding to FIG. 14 .

Please refer to FIG. 17 , which is a schematic structural diagram of another target determination apparatus provided by the present application, as described below.

The target determination apparatus may include a processor 1701 and a memory 1702 . The processor 1701 and the memory 1702 are interconnected by wires. Among them, the memory 1702 stores program instructions and data.

The program instructions and data corresponding to the steps in FIG. 6 or FIG. 8 are stored in the memory 1702 .

The processor 1701 is configured to perform the method steps performed by the target determination apparatus shown in any of the foregoing embodiments in FIG. 6 or FIG. 8 .

Embodiments of the present application also provide a computer-readable storage medium, where a program for generating a vehicle's running speed is stored in the computer-readable storage medium, and when the computer is running on a computer, the computer is made to execute the program shown in FIG. 6 or FIG. 8 above. The illustrated embodiment describes the steps in the method.

Optionally, the aforementioned target determination device shown in FIG. 17 is a chip.

The embodiments of the present application also provide a digital processing chip. The digital processing chip integrates circuits and one or more interfaces for implementing the above-mentioned processor 1701 or the functions of the processor 1701 . When a memory is integrated in the digital processing chip, the digital processing chip can perform the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface. The digital processing chip implements the actions performed by the target determination device in the above embodiment according to the program codes stored in the external memory.

An embodiment of the present application also provides a computer program product, which, when driving on a computer, causes the computer to execute the steps performed by the target determination device in the method described in the embodiment shown in FIG. 6 or FIG. 8 .

The target determination apparatus provided in this embodiment of the present application may be a chip, and the chip includes: a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit. The processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the target determination method described in the embodiment shown in FIG. 6 or FIG. 8 . Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.

Specifically, the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or it may be any conventional processor or the like.

Specifically, please refer to FIG. 18. FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the application. The chip may be represented as a neural network processor NPU 180, and the NPU 180 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1803, which is controlled by the controller 1804 to extract the matrix data in the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 1803 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 1803 is a two-dimensional systolic array. The arithmetic circuit 1803 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1803 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory 1802 and buffers it on each PE in the operation circuit. The arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1801 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1808 .

Unified memory 1806 is used to store input data and output data. The weight data is directly passed through the storage unit access controller (direct memory access controller, DMAC) 1805, and the DMAC is transferred to the weight memory 1802. Input data is also moved to unified memory 1806 via the DMAC.

A bus interface unit (BIU) 1810 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1809.

The bus interface unit 1810 (bus interface unit, BIU) is used for the instruction fetch memory 1809 to obtain instructions from the external memory, and also for the storage unit access controller 1805 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1806 , the weight data to the weight memory 1802 , or the input data to the input memory 1801 .

The vector calculation unit 1807 includes a plurality of operation processing units, and if necessary, further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network computations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

In some implementations, the vector computation unit 1807 can store the processed output vectors to the unified memory 1806 . For example, the vector calculation unit 1807 may apply a linear function and/or a non-linear function to the output of the operation circuit 1803, such as linear interpolation of the feature plane extracted by the convolution layer, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit 1807 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 1803, such as for use in subsequent layers in a neural network.

The instruction fetch buffer (instruction fetch buffer) 1809 connected to the controller 1804 is used to store the instructions used by the controller 1804;

The unified memory 1806, the input memory 1801, the weight memory 1802 and the instruction fetch memory 1809 are all On-Chip memories. External memory is private to the NPU hardware architecture.

The operation of each layer in the RNN can be performed by the operation circuit 1803 or the vector calculation unit 1807 .

Wherein, the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the program of the method in FIG. 6 or FIG. 8 .

In addition, it should be noted that the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. Special components, etc. to achieve. Under normal circumstances, all functions completed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structures used to implement the same function can also be various, such as analog circuits, digital circuits or special circuit, etc. However, a software program implementation is a better implementation in many cases for this application. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that make contributions to the prior art. The computer software products are stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present application.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. The term "and/or" in this application is only an association relationship to describe associated objects, which means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, independently There are three cases of B. In addition, the character "/" in this article generally indicates that the related objects before and after are an "or" relationship. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion, eg, a process, method, system, product or device comprising a series of steps or modules not necessarily limited to those expressly listed Rather, those steps or modules may include other steps or modules not expressly listed or inherent to these processes, methods, products or apparatus. The naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved. The division of modules in this application is a logical division. In practical applications, there may be other divisions. For example, multiple modules may be combined or integrated into another system, or some features may be ignored. , or not implemented, in addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some ports, and the indirect coupling or communication connection between modules may be electrical or other similar forms. There are no restrictions in the application. In addition, the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed into multiple circuit modules, and some or all of them may be selected according to actual needs. module to achieve the purpose of the solution of this application.

Claims

A method for determining a target, comprising:

Obtain an image to be processed and multiple millimeter wave detection points, the image to be processed and the multiple millimeter wave detection points are data obtained synchronously for the same detection target, each of the millimeter wave detection points includes depth information, and the The depth information is used to indicate the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to obtain the plurality of millimeter-wave detection points;

mapping the plurality of millimeter wave detection points to the to-be-processed image;

A plurality of candidate frames of the detection target on the to-be-processed image are determined according to first information, where the first information includes the depth information and position information of each of the millimeter wave detection points, and the position information uses to represent the position of each millimeter wave detection point mapped on the image to be processed;

Perform non-maximum suppression NMS processing on the plurality of candidate frames according to the depth information to output a target frame and a target millimeter wave detection point, where the target frame is based on the depth information of the target millimeter wave detection point and the location information is determined.
The method according to claim 1, wherein the performing non-maximum value suppression NMS processing on the plurality of candidate frames according to the depth information comprises:

Perform non-maximum suppression NMS processing on the plurality of candidate frames according to a first score and a second score, where the first score indicates that the detection target in each candidate frame determined by the classifier belongs to each of the N categories The probability of the number of categories, the N categories are preset categories, the N is a positive integer, the second score represents the probability distribution between the depth information and the first probability distribution between the categories The probability that the detection target in each candidate box belongs to each of the N classes.
The method according to claim 2, wherein the method further comprises:

Statistics are performed on the data in the first set to determine the probability distribution of the first size of the statistical target corresponding to each category, and the first set includes a plurality of statistical targets corresponding to each category, and each Describe the size information of the statistical target;

The first probability distribution is determined according to the probability distribution of the first size and a first relationship, where the first relationship is a relationship between the size of the statistical target and the depth information of the millimeter-wave detection point corresponding to the statistical target relation.
The method according to claim 3, wherein the method further comprises:

Statistics are performed on the data in the second set to determine the probability distribution of the second size of the statistical target corresponding to each category, and the second set includes a plurality of statistical targets corresponding to each category, and each Describe the size information of the statistical target;

A second probability distribution is determined according to the second size distribution and a second relationship, the second probability distribution is used to update the first probability distribution, and the second relationship is the size of the statistical target and the statistical target The relationship between the depth information of the corresponding mmWave detection points.
The method according to claim 3 or 4, wherein the size information is height information of the statistical target.
The method according to any one of claims 1 to 5, wherein the position information is used to determine the position of the candidate frame in the image to be processed in combination with the distribution characteristics of the millimeter wave detection points on the vehicle Location.
The method according to any one of claims 1 to 6, wherein the depth information is used to determine the size of the candidate frame, and the size of the candidate frame is negatively correlated with the depth information.
The method according to any one of claims 1 to 7, wherein the method further comprises:

Process the to-be-processed image through an efficient regional convolutional neural network Faster-RCNN to obtain a first feature map of the to-be-processed image;

extracting second feature maps corresponding to the plurality of candidate frames from the first feature map;

The second feature map is processed through a regression network and a classifier to obtain a first result, and the first result is used for non-maximum suppression NMS processing.
The method according to any one of claims 1 to 8, wherein the image to be processed is acquired by a vision sensor, the sampling frequency of the vision sensor is the first frequency, and the sampling frequency of the millimeter wave radar is the first frequency Two frequencies, the difference between the first frequency and the second frequency is not greater than a preset threshold.
A device for determining a target, comprising:

The acquisition module is configured to acquire the image to be processed and multiple millimeter wave detection points, the image to be processed and the multiple millimeter wave detection points are data acquired synchronously, and each of the millimeter wave detection points includes depth information, and the The depth information is used to indicate the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to obtain the plurality of millimeter-wave detection points;

a mapping module, configured to map the plurality of millimeter wave detection points acquired by the acquisition module to the to-be-processed image acquired by the acquisition module;

a processing module, configured to determine a plurality of candidate frames of the detection target on the to-be-processed image according to first information, where the first information includes the depth information and position information of each of the millimeter wave detection points, The position information is used to indicate the position of each millimeter wave detection point mapped on the image to be processed;

The processing module is further configured to perform non-maximum suppression NMS processing on the plurality of candidate frames according to the depth information, so as to output a target frame and a target millimeter wave detection point, and the target frame is based on the target millimeter wave detection point. The depth information and the position information of the wave detection point are determined.
The target determination device according to claim 10, wherein the processing module is specifically configured to:

Perform non-maximum suppression NMS processing on the plurality of candidate frames according to a first score and a second score, where the first score indicates that the detection target in each candidate frame belongs to N categories determined according to the classifier. The probability of each category, the N categories are preset categories, the N is a positive integer, and the second score indicates that it is determined according to the first probability distribution between the depth information and each category , the probability that the detection target in each candidate box belongs to each of the N categories.
The target determination device according to claim 11, wherein the target determination device further comprises a statistics module,

The statistics module is configured to perform statistics on the data in the first set, and determine the probability distribution of the first size of the statistical target corresponding to each category, and the first set includes a plurality of statistical objects, and size information for each of said statistical objects;

The first probability distribution is determined according to the probability distribution of the first size and a first relationship, where the first relationship is a relationship between the size of the statistical target and the depth information of the millimeter-wave detection point corresponding to the statistical target relation.
The target determination device according to claim 12, wherein the statistics module is further configured to:

Statistics are performed on the data in the second set to determine the probability distribution of the second size of the statistical target corresponding to each category, and the second set includes a plurality of statistical targets corresponding to each category, and each Describe the size information of the statistical target;

A second probability distribution is determined according to the second size distribution and a second relationship, the second probability distribution is used to update the first probability distribution, and the second relationship is the size of the statistical target and the statistical target The relationship between the depth information of the corresponding mmWave detection points.
The target determination device according to claim 12 or 13, wherein the size information is height information of the statistical target.
The target determination device according to any one of claims 10 to 14, wherein the position information is used to determine the candidate frame in the image to be processed in combination with the distribution characteristics of the millimeter wave detection points on the vehicle in the location.
The target determination apparatus according to any one of claims 10 to 15, wherein the depth information is used to determine the size of the candidate frame, and the size of the candidate frame is negatively correlated with the depth information.
The target determination device according to any one of claims 10 to 16, wherein the processing module is further configured to:

Perform convolution processing on the to-be-processed image to obtain a first feature map of the to-be-processed image;

extracting second feature maps corresponding to the plurality of candidate frames from the first feature map;

The second feature map is processed through a regression network and a classifier to obtain a first result, and the first result is used for non-maximum suppression NMS processing.
The target determination device according to any one of claims 10 to 17, wherein the image to be processed is acquired by a vision sensor, the sampling frequency of the vision sensor is the first frequency, and the sampling frequency of the millimeter-wave radar is the first frequency. is the second frequency, and the difference between the first frequency and the second frequency is not greater than a preset threshold.
A smart car, characterized in that the smart car includes a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, a right is realized The method of any one of claims 1 to 9.
A monitoring device, characterized in that the monitoring device has a processor, the processor is coupled to a memory, the memory stores program instructions, and the claims are realized when the program instructions stored in the memory are executed by the processor The method of any one of 1 to 9.
A computer-readable storage medium comprising a program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 9.
A target determination system, characterized in that the target determination system includes a terminal-side device and a cloud-side device,

The end-side device is used to acquire an image to be processed and multiple millimeter-wave detection points, the image to be processed and the multiple millimeter-wave detection points are data acquired synchronously, and each of the millimeter-wave detection points includes a depth information, the depth information is used to indicate the distance between the detection target and the millimeter-wave radar, and the millimeter-wave radar is used to obtain the multiple millimeter-wave detection points;

the cloud-side device, configured to receive the to-be-processed image and multiple millimeter wave detection points sent by the terminal-side device;

The cloud-side device is further configured to map the plurality of millimeter wave detection points to the to-be-processed image;

The cloud-side device is further configured to determine a plurality of candidate frames of the detection target on the to-be-processed image according to first information, where the first information includes the depth information of each of the millimeter wave detection points and position information, the position information is used to represent the position of each millimeter wave detection point mapped on the image to be processed;

The cloud-side device is further configured to perform non-maximum suppression NMS processing on the plurality of candidate frames according to the depth information, so as to output a target frame and a target millimeter wave detection point, and the target frame is based on the target frame. The depth information and the position information of the millimeter wave detection point are determined.