CN115240168A - Perception result obtaining method and device, computer equipment and storage medium - Google Patents

Perception result obtaining method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115240168A
CN115240168A CN202210898481.7A CN202210898481A CN115240168A CN 115240168 A CN115240168 A CN 115240168A CN 202210898481 A CN202210898481 A CN 202210898481A CN 115240168 A CN115240168 A CN 115240168A
Authority
CN
China
Prior art keywords
view
features
image
perception
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210898481.7A
Other languages
Chinese (zh)
Inventor
唐圣钦
何明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DeepRoute AI Ltd
Original Assignee
DeepRoute AI Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DeepRoute AI Ltd filed Critical DeepRoute AI Ltd
Priority to CN202210898481.7A priority Critical patent/CN115240168A/en
Publication of CN115240168A publication Critical patent/CN115240168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Abstract

The application relates to a perception result obtaining method, a perception result obtaining device, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring multi-view image data and point cloud data of the same scene; extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features; mapping the multi-view image features into image aerial view features according to the first perception result; determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data; fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic; and determining a target perception result according to the target aerial view characteristics. The method can obtain accurate sensing output results.

Description

Perception result obtaining method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of automatic driving technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for obtaining a sensing result.
Background
With the development of artificial intelligence, the automatic driving technology is also rapidly developing. However, achieving full autopilot remains a difficult task due to the complex, dynamic driving environment. In order to understand the driving environment around the vehicle, the autonomous vehicle needs to be equipped with a set of sensors to perform strong and accurate environmental sensing, and by processing and analyzing the sensor data, information about the environment, other objects around (e.g., vehicles, pedestrians), and the autonomous vehicle itself is obtained.
Sensors on autonomous vehicles typically include camera sensors, lidar sensors, and solid-state radar sensors, among others. The perceptual model needs to perform multiple important tasks such as target detection, tracking, positioning, mapping, etc. simultaneously. The traditional technology uses a single model to output a single task perception result, the number of the needed models is large, and the output perception result is inaccurate.
Disclosure of Invention
In view of the above, it is necessary to provide a sensing result obtaining method, an apparatus, a computer device, a computer readable storage medium, and a computer program product capable of outputting an accurate sensing result.
In a first aspect, the present application provides a method for obtaining a sensing result. The method comprises the following steps:
acquiring multi-view image data and point cloud data of the same scene;
extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
mapping the multi-view image features into image aerial view features according to the first perception result;
determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data;
fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and determining a target perception result according to the target aerial view characteristics.
In one embodiment, the determining the multi-view image feature according to the multi-scale image feature comprises:
extracting the same characteristics corresponding to different visual angles in the multi-scale image characteristics;
and performing weighted fusion on the same characteristics corresponding to the different visual angles to obtain the multi-visual-angle image characteristics.
In one embodiment, the determining, according to the multi-scale image feature, a first perception result corresponding to a perception task includes:
performing corresponding feature extraction on the multi-scale image features according to the type of the perception task to obtain perception features corresponding to the perception task;
and predicting the perception characteristics to obtain a first perception result corresponding to the perception task.
In one embodiment, the first perception result includes depth information; the mapping the multi-view image features into image aerial view features according to the first perception result comprises:
grouping the multi-view image features according to the depth information to obtain a multi-view image feature group; the different multi-view image feature groups correspond to different depth information intervals;
and determining the aerial view characteristics of the image according to the characteristics of the multi-view image characteristic set and the depth information interval corresponding to the multi-view image characteristic set.
In one embodiment, the determining a target perception result according to the target bird's eye view feature includes:
carrying out three-dimensional target detection on the target aerial view characteristics to obtain a second sensing result;
and determining the target perception result according to the first perception result and the second perception result.
In one embodiment, the fusing the image aerial view feature and the point cloud aerial view feature to obtain a target aerial view feature includes:
and carrying out self-adaptive weighting on the image aerial view characteristics and the point cloud aerial view characteristics to obtain the target aerial view characteristics.
In a second aspect, the present application further provides a device for obtaining sensing results. The device comprises:
the data acquisition module is used for acquiring multi-view image data and point cloud data of the same scene;
the first perception module is used for extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
the first extraction module is used for mapping the multi-view image features into image aerial view features according to the first perception result;
the second extraction module is used for determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data;
the characteristic fusion module is used for fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and the second perception module is used for determining a target perception result according to the target aerial view characteristic.
In a third aspect, the application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
acquiring multi-view image data and point cloud data of the same scene;
extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
mapping the multi-view image features into image aerial view features according to the first perception result;
determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data;
fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and determining a target perception result according to the target aerial view characteristics.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring multi-view image data and point cloud data of the same scene;
extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
mapping the multi-view image features into image aerial view features according to the first perception result;
determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data;
fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and determining a target perception result according to the target aerial view characteristic.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
acquiring multi-view image data and point cloud data of the same scene;
extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
mapping the multi-view image features into image aerial view features according to the first perception result;
determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data;
fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and determining a target perception result according to the target aerial view characteristics.
According to the perception result obtaining method, the perception result obtaining device, the computer equipment, the storage medium and the computer program product, multi-view image data and point cloud data of the same scene are obtained; extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are the features after the multi-scale image features are fused; mapping the multi-view image features into image aerial view features according to the first perception result; determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data; fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic; and determining a target perception result according to the characteristics of the target aerial view. According to the method and the device, the characteristics of the multi-view image data and the point cloud data are extracted respectively, corresponding image aerial view characteristics and point cloud aerial view characteristics are obtained, the image aerial view characteristics and the point cloud aerial view characteristics are fused to obtain target aerial view characteristics, and an accurate target perception result can be determined according to the target aerial view characteristics.
Drawings
FIG. 1 is a diagram of an application environment of a method for obtaining perceptual results in an embodiment;
FIG. 2 is a schematic flow chart of a sensing result obtaining method according to an embodiment;
FIG. 3 is a flow chart illustrating step 204 in one embodiment;
FIG. 4 is a schematic flow chart of step 204 in another embodiment;
FIG. 5 is a flow chart illustrating step 206 in one embodiment;
FIG. 6 is a flow chart illustrating step 212 in one embodiment;
FIG. 7 is a schematic flowchart of a sensing result obtaining method according to another embodiment;
FIG. 8 is a block diagram showing the structure of a sensing result obtaining apparatus according to an embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The method for obtaining the sensing result provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be placed on the cloud or other network server. The server 104 receives multi-view image data and point cloud data of the same scene sent by the terminal 102, extracts image features in the multi-view image data to obtain multi-scale image features, and determines the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are the features after the multi-scale image features are fused; mapping the multi-view image features into image aerial view features according to the first perception result; determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data; fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic; and determining a target perception result according to the characteristics of the target aerial view.
Where the terminal 102 may be an autonomous terminal, in one embodiment the terminal 102 may be an unmanned vehicle or a manned vehicle with autonomous driving capabilities. . The server 104 may be implemented as a stand-alone server or a server cluster comprised of multiple servers.
It should be noted that, the embodiments of the present application may also be applied to a terminal alone, or may also be applied to a server alone, and are not limited herein.
In one embodiment, as shown in fig. 2, a method for obtaining sensing results is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps 202 to 212.
Step 202, multi-view image data and point cloud data of the same scene are acquired.
The server can acquire multi-view image data and point cloud data of the same scene sent by the terminal, and also can directly acquire the multi-view image data and the point cloud data of the same scene from the corresponding data input interface. In order to understand the driving environment around the vehicle, the automatic driving vehicle is provided with multiple sensors to collect the surrounding environment data, wherein the multiple sensors comprise a camera sensor, a laser radar sensor or a solid-state radar sensor and the like. The same scene can be understood as sensor data acquired by the sensors under the same timestamp. The multi-view image data may be image data captured from a plurality of views by a camera sensor, for example, a plurality of camera sensors are arranged around the vehicle, and the plurality of camera sensors may capture image data corresponding to a 360-degree scene around the vehicle. The point cloud data can be acquired by radar sensors such as laser radar sensors, solid-state radar sensors, ultrasonic radar sensors or millimeter wave radar sensors.
It is understood that the multi-view image data photographed by the camera sensor is two-dimensional data, and the point cloud data collected by the radar sensor is three-dimensional data.
Step 204, extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are the features after the multi-scale image features are fused.
The server can extract the image features in the multi-view image data based on a backbone network in a transform network architecture to obtain the multi-scale image features. The backbone network generally refers to a network for extracting features, and is used for extracting image features, and may be, for example, a convolutional neural network. The backbone Network may be one of CSPDarkNet (Cross Stage Partial dark net), resNet (Residual Neural Network), VGGNet (Visual Geometry Group Neural Network), alexnet (sense conditional Network), googleNet, mobileNet, and the like.
The multi-scale image features refer to image features of multiple scales, can be understood as image features of different sizes, and obtain image features corresponding to different scales by reducing down-sampling multiples. Illustratively, the original image data is respectively subjected to downsampling multiples of 4,8,16,32 to obtain corresponding multi-scale image features.
Optionally, in the multi-scale image features, the image features of each scale include dimensions such as the number of samples, the number of channels, and the height and width of the feature image.
And the server determines multi-view image characteristics according to the multi-scale image characteristics, wherein the multi-view image characteristics are the characteristics of the multi-scale image characteristics after fusion. Optionally, the multi-scale image features are directly fused to obtain multi-view image features, or the same features corresponding to different views in the multi-scale image features are fused to obtain the multi-view image features. That is, a multi-view image feature may characterize different appearances of the same feature at multiple views. Optionally, the same feature of different viewing angles in the multi-scale image features may be extracted through a multi-scale variability Transformer, and weighted fusion is performed to obtain the multi-viewing angle image features.
And the server determines a first perception result corresponding to the perception task according to the multi-scale image characteristics. In this embodiment, the sensing tasks include a plurality of sensing tasks, and different sensing tasks extract multi-scale image sub-features matched with the sensing tasks from the multi-scale image features and predict the multi-scale image sub-features, so as to determine a first sensing result corresponding to the sensing tasks. In general, different sensing tasks correspond to different first sensing results, and the first sensing results may include two-dimensional target detection results, two-dimensional semantic segmentation results, depth estimation results, and the like. Optionally, a neck network in the sensing task may be used to extract a multi-scale image sub-feature matched with the sensing task from the multi-scale image features, and then a head network is used to perform target prediction on the multi-scale image sub-feature, so as to obtain a first sensing result corresponding to the sensing task.
And step 206, mapping the multi-view image features into image aerial view features according to the first perception result.
In this embodiment, the server may map the multi-view image feature into an image bird's-eye view feature according to the first sensing result. Here, bird's Eye View (BEV) can be understood as a projection of a point cloud onto a plane perpendicular to the height direction. In the field of autopilot, it is more desirable to have the perception result in ground level space so that it can be given downstream use without dimension conversion.
Optionally, the multi-view image feature may be mapped to the image bird's eye view feature according to the depth information in the first perception result. And mapping the multi-view image features into corresponding image aerial view features according to different depth information intervals.
And 208, determining the aerial view characteristics of the point cloud corresponding to the point cloud data according to the point cloud data.
In this embodiment, the corresponding point cloud features can be obtained according to the point cloud data, and the point cloud aerial view features corresponding to the point cloud data can be determined according to the point cloud features. For example, a point cloud network (pointnet) may be used to obtain point cloud features. The corresponding point cloud aerial view feature may also be obtained from the point cloud data by other methods, which are not limited herein. In one embodiment, the point cloud data may be captured by a laser radar, an ultrasonic radar, or a millimeter wave radar, and further, the point cloud data may be provided by a radar sensor device that meets the accuracy requirements for autonomous driving perception, and the specific sensor category is not limited herein.
Optionally, creating a point cloud coordinate corresponding to the point cloud data and a bird's-eye view coordinate corresponding to the bird's-eye view; and mapping the coordinates of each point cloud into the aerial view coordinates to obtain a corresponding pixel position in the aerial view, filling the height value of the point cloud data into the pixel of the corresponding position as the pixel value of the pixel, adjusting the pixel value to be between 0 and 255 so as to obtain the mapped aerial view, and then extracting the aerial view characteristics of the point cloud from the aerial view.
And step 210, fusing the image aerial view characteristics and the point cloud aerial view characteristics to obtain target aerial view characteristics.
In this embodiment, the image aerial view feature and the point cloud aerial view feature are fused to obtain a target aerial view feature. Optionally, after the image aerial view feature and the point cloud aerial view feature are aligned, the image aerial view feature and the point cloud aerial view feature are subjected to weighted fusion to obtain the target aerial view feature.
And step 212, determining a target perception result according to the target aerial view characteristics.
In this embodiment, the target perception result may be determined according to the target bird's-eye view image feature. Optionally, the target bird's-eye view feature can be predicted to obtain a target perception result. Or, the three-dimensional target detection may be performed on the target bird's-eye view feature to obtain a second sensing result, and the target sensing result may be determined according to the first sensing result and the second sensing result, for example, the first sensing result and the second sensing result may be directly used as the target sensing result.
The perception result acquisition method comprises the steps of acquiring multi-view image data and point cloud data of the same scene; extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are the features after the multi-scale image features are fused; mapping the multi-view image features into image aerial view features according to the first perception result; determining the aerial view characteristics of the point cloud corresponding to the point cloud data according to the point cloud data; fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic; and determining a target perception result according to the target aerial view characteristics. In the embodiment, the multi-view image data and the point cloud data are respectively subjected to feature extraction, corresponding image aerial view features and point cloud aerial view features are obtained, the image aerial view features and the point cloud aerial view features are fused to obtain target aerial view features, an accurate target perception result can be determined according to the target aerial view features, the target perception result can simultaneously comprise two-dimensional and three-dimensional perception results, and various types of target perception results can be simultaneously output.
In one embodiment, as shown in fig. 3, determining the multi-view image feature from the multi-scale image feature in step 204 includes the following steps 302 to 304.
Step 302, extracting the same feature corresponding to different view angles in the multi-scale image feature.
And corresponding to the multi-scale image features at different view angles, and extracting the same feature in the multi-scale image features at different view angles. The camera sensors at different viewing angles correspond to the camera sensors at different positions, that is, the camera sensors at different positions correspond to image features at different viewing angles, and the image features at the same viewing angle include multi-scale image features. The same features can be extracted for multi-scale image features of different perspectives.
And step 304, performing weighted fusion on the same features corresponding to different visual angles to obtain multi-visual-angle image features.
After the same features corresponding to different viewing angles are obtained, the corresponding weights can be determined according to the proportion of the same features corresponding to different viewing angles or the size of the features, and the same features are subjected to weighted fusion according to the corresponding weights to obtain the multi-view image features.
In this embodiment, the same features corresponding to different viewing angles in the multi-scale image features are extracted, and the same features are subjected to weighted fusion to obtain the multi-viewing-angle image features, so that the same features in the multi-viewing-angle image features are fused with the features of the multiple viewing angles, and the multi-viewing-angle image features have different viewing angle information.
In one embodiment, as shown in fig. 4, the determining, in step 204, a first perception result corresponding to the perception task according to the multi-scale image feature includes the following steps 402 to 404.
And 402, performing corresponding feature extraction on the multi-scale image features according to the type of the perception task to obtain perception features corresponding to the perception task.
In the embodiment, the method comprises a plurality of perception tasks, and corresponding feature extraction can be performed on the multi-scale image features according to the type of the perception tasks to obtain perception features corresponding to the perception tasks. For one perception task, extracting corresponding multi-scale image sub-features from the multi-scale image features according to the type of the perception task, and taking the multi-scale image sub-features as perception features corresponding to the perception task. For example, when the perception task is a traffic light detection task, color features at the position of a traffic light in the multi-scale image features are extracted as perception features corresponding to the traffic light detection task. For another example, when the perception task is a depth estimation task, the distance between the front vehicle and the rear vehicle in the multi-scale image features is extracted, and the distance between the front vehicle and the rear vehicle is used as the perception feature corresponding to the depth estimation task.
And step 404, predicting the perception characteristics to obtain a first perception result corresponding to the perception task.
And the server predicts the perception characteristics to obtain a first perception result corresponding to the perception task. Wherein, a kind of perception task corresponds to a kind of first perception result. For example, when the sensing task is a traffic light detection task, the color features at the positions of the traffic lights are extracted, the color features at the positions of the red, red and green lights are predicted, and a color prediction result of the traffic lights is obtained, wherein the color prediction result of the traffic lights is a first sensing result corresponding to the traffic light detection task.
In this embodiment, corresponding feature extraction is performed on the multi-scale image features according to the type of the sensing task to obtain the sensing features corresponding to the sensing task, and the same multi-scale image features can be used to implement processing of multiple sensing tasks, so that the corresponding first sensing result is obtained, the computing power can be saved, and the utilization rate of computing resources can be improved.
In one embodiment, as shown in fig. 5, the first perception result includes depth information; and step 206, according to the first perception result, mapping the multi-view image feature into an image aerial view feature, wherein the step includes steps 502 to 504.
Step 502, grouping the multi-view image features according to the depth information to obtain a multi-view image feature group; and different multi-view image feature groups correspond to different depth information intervals.
And the server groups the multi-view image features according to the depth information to obtain a plurality of multi-view image feature groups, wherein different multi-view image features correspond to different depth information intervals. In this embodiment, the grouping number may be determined according to the depth information, the depth information may be divided into depth information sections of corresponding number according to the grouping number, and the multi-view image features may be divided into multi-view image feature groups of corresponding number, where one multi-view image feature group corresponds to one depth information section.
Illustratively, the maximum depth value in the first perception result is 10 meters, the preset grouping number is 5 groups, and then the depth values are correspondingly divided into 5 depth intervals, i.e. 0-2 meters, 2-4 meters, 4-6 meters, 6-8 meters and 8-10 meters, the multi-view image features are correspondingly divided into 5 multi-view image feature groups, the first multi-view image feature group corresponds to 0-2 meters, the second multi-view image feature group corresponds to 2-4 meters, and so on.
And step 504, determining the aerial view characteristics of the image according to the characteristics of the multi-view image characteristic set and the depth information interval corresponding to the multi-view image characteristic set.
In this embodiment, the bird's-eye view image feature of the image may be determined according to the feature of the multi-view image feature group and the depth information interval corresponding to the multi-view image feature group. Alternatively, the features of the multi-view image feature group and the depth information interval corresponding to the multi-view image feature group may be used as the corresponding image bird's-eye view features, and all the multi-view image feature groups and the corresponding depth information intervals may be used to obtain all the image bird's-eye view features. Illustratively, the description is given for one multi-view image feature set, and the features of the multi-view image feature set and the depth information in the depth information interval corresponding to the multi-view image feature set are taken as the corresponding image bird's eye view features.
In this embodiment, the multi-view image features are grouped according to the depth information to obtain a multi-view image feature group, and the image bird's-eye view features, that is, the multi-view image feature groups with different depths in the image bird's-eye view features, are determined according to the features of the multi-view image feature group and the depth information in the depth information interval corresponding to the multi-view image feature group, so that the image bird's-eye view features can be differentiated in the depth dimension.
In one embodiment, as shown in fig. 6, the step 212 of determining the target perception result according to the target bird's eye view feature includes steps 602 to 604.
And step 602, performing three-dimensional target detection on the target aerial view characteristics to obtain a second perception result.
The purpose of three-dimensional object detection is to identify objects of interest and determine the location and class of the objects. For example, three-dimensional target detection of the target bird's eye view feature can be realized through an anchor-free algorithm, and a second perception result is obtained. Correspondingly, the second perception result comprises information such as the position, distance, depth and angle of the object around the automatic driving vehicle.
And step 604, determining a target sensing result according to the first sensing result and the second sensing result.
In this embodiment, the first sensing result and the second sensing result may be fused to obtain the target sensing result, or the first sensing result and the second sensing result may be directly used as the target sensing result, so that the obtained target sensing result includes two-dimensional and three-dimensional sensing results at the same time.
In some embodiments, fusing the image aerial view feature and the point cloud aerial view feature to obtain a target aerial view feature comprises: and carrying out self-adaptive weighting on the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic.
In this embodiment, adaptive weighting may be performed on the image bird's-eye view feature and the point cloud bird's-eye view feature according to different application scenarios, so as to obtain a target bird's-eye view feature. Optionally, a fusion model of the image aerial view features and the point cloud aerial view features can be learned and trained in advance, and then the model can weight the image aerial view features and the point cloud aerial view features by giving different weights to the identified different scenes. Alternatively, the image bird's-eye view feature and the point bird's-eye view feature may be adaptively weighted according to a request for the target bird's-eye view feature, so as to obtain the target bird's-eye view feature. For example, if semantic information such as colors is desired to be added to the needed target bird's-eye view features, the image bird's-eye view features will be given more weight; conversely, if more dimensional representations are desired in the desired target bird's eye view feature, i.e., the perception of three-dimensional space is to be very accurate, then the point cloud bird's eye view feature is given more weight.
In some embodiments, the sensing task includes at least one of a traffic light detection task, a vehicle head and vehicle tail detection task, a pedestrian detection task, a depth detection task, a lane line detection task, a sign board detection task, a ground sign detection task, and a telegraph pole detection task, that is, one or more sensing tasks may be processed simultaneously, and output corresponding to one or more sensing tasks, so that output of the sensing tasks of the automatically driven vehicle can be realized according to actual needs.
In one embodiment, as shown in fig. 7, the sensing result obtaining method includes the following steps (1) to (7).
(1) Multi-view image data and point cloud data of the same scene are acquired.
(2) Extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are the features after the multi-scale image features are fused.
(3) And mapping the multi-view image features into image aerial view features according to the first perception result.
(4) And extracting point cloud characteristics corresponding to the point cloud data according to the point cloud data, and determining corresponding point cloud aerial view characteristics according to the point cloud characteristics.
(5) And fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic.
(6) And carrying out three-dimensional target detection on the target aerial view characteristics to obtain a second perception result.
(7) And taking the first perception result and the second perception result as target perception results.
In the perception result obtaining method in this embodiment, the multi-view image data and the point cloud data are subjected to feature extraction respectively, corresponding image aerial view features and point cloud aerial view features are obtained, the image aerial view features and the point cloud aerial view features are fused to obtain target aerial view features, a second perception result is obtained according to the target aerial view features, a two-dimensional first perception result and a three-dimensional second perception result can be obtained simultaneously, and a comprehensive and accurate perception output result can be obtained.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a sensing result obtaining apparatus for implementing the above-mentioned sensing result obtaining method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so that the specific limitations in one or more embodiments of the apparatus for obtaining a sensing result provided below may refer to the limitations on the method for obtaining a sensing result in the above description, and are not described herein again.
In one embodiment, as shown in fig. 8, there is provided a sensing result obtaining apparatus including: a data acquisition module 802, a first perception module 804, a first extraction module 806, a second extraction module 808, a feature fusion module 810, and a second perception module 812, wherein:
a data acquiring module 802, configured to acquire multi-view image data and point cloud data of the same scene;
the first perception module 804 is configured to extract image features in the multi-view image data to obtain multi-scale image features, and determine the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
a first extraction module 806, configured to map the multi-view image feature into an image bird's-eye view feature according to the first sensing result;
a second extraction module 808, configured to determine, according to the point cloud data, a point cloud aerial view feature corresponding to the point cloud data;
the feature fusion module 810 is configured to fuse the image aerial view feature and the point cloud aerial view feature to obtain a target aerial view feature;
and a second perception module 812, configured to determine a target perception result according to the target bird's-eye view feature.
In one embodiment, the first sensing module 804 is further configured to:
extracting the same characteristics corresponding to different visual angles in the multi-scale image characteristics;
and performing weighted fusion on the same characteristics corresponding to the different visual angles to obtain the multi-visual-angle image characteristics.
In one embodiment, the first sensing module 804 is further configured to:
performing corresponding feature extraction on the multi-scale image features according to the type of the perception task to obtain perception features corresponding to the perception task;
and predicting the perception characteristics to obtain a first perception result corresponding to the perception task.
In one embodiment, the first perception result comprises depth information; the first extraction module 806 is further configured to:
grouping the multi-view image features according to the depth information to obtain a multi-view image feature group; the different multi-view image feature groups correspond to different depth information intervals;
and determining the aerial view characteristics of the image according to the characteristics of the multi-view image characteristic set and the depth information interval corresponding to the multi-view image characteristic set.
In one embodiment, the second sensing module 812 is further configured to:
carrying out three-dimensional target detection on the target aerial view characteristics to obtain a second sensing result;
and determining the target perception result according to the first perception result and the second perception result.
In one embodiment, the feature fusion module 810 is further configured to:
and carrying out self-adaptive weighting on the image aerial view characteristics and the point cloud aerial view characteristics to obtain the target aerial view characteristics.
All or part of the modules in the sensing result acquisition device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a perceptual result acquisition method.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the sensing result obtaining method in the foregoing embodiments when executing the computer program.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the sensing result obtaining method in the above embodiments.
In one embodiment, a computer program product is provided, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the perception result obtaining method in the foregoing embodiments.
It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A sensing result obtaining method is characterized by comprising the following steps:
acquiring multi-view image data and point cloud data of the same scene;
extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
mapping the multi-view image features into image aerial view features according to the first perception result;
determining a point cloud aerial view characteristic corresponding to the point cloud data according to the point cloud data;
fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and determining a target perception result according to the target aerial view characteristics.
2. The method of claim 1, wherein determining the multi-view image feature from the multi-scale image feature comprises:
extracting the same characteristics corresponding to different visual angles in the multi-scale image characteristics;
and performing weighted fusion on the same characteristics corresponding to the different visual angles to obtain the multi-visual-angle image characteristics.
3. The method according to claim 1, wherein the determining a first perception result corresponding to a perception task according to the multi-scale image features comprises:
performing corresponding feature extraction on the multi-scale image features according to the type of the perception task to obtain perception features corresponding to the perception task;
and predicting the perception characteristics to obtain a first perception result corresponding to the perception task.
4. The method of claim 1, wherein the first perceptual result comprises depth information; the mapping the multi-view image features into image aerial view features according to the first perception result comprises:
grouping the multi-view image features according to the depth information to obtain a multi-view image feature group; the different multi-view image feature groups correspond to different depth information intervals;
and determining the aerial view characteristics of the image according to the characteristics of the multi-view image characteristic set and the depth information interval corresponding to the multi-view image characteristic set.
5. The method of claim 1, wherein determining a target perception result from the target aerial view feature comprises:
carrying out three-dimensional target detection on the target aerial view characteristics to obtain a second sensing result;
and determining the target perception result according to the first perception result and the second perception result.
6. The method of claim 1, wherein said fusing the image bird's eye view feature and the point cloud bird's eye view feature to obtain a target bird's eye view feature comprises:
and carrying out self-adaptive weighting on the image aerial view characteristics and the point cloud aerial view characteristics to obtain the target aerial view characteristics.
7. A sensing result obtaining apparatus, comprising:
the data acquisition module is used for acquiring multi-view image data and point cloud data of the same scene;
the first perception module is used for extracting image features in the multi-view image data to obtain multi-scale image features, and determining the multi-view image features and a first perception result corresponding to a perception task according to the multi-scale image features; the multi-view image features are fused features of the multi-scale image features;
the first extraction module is used for mapping the multi-view image features into image aerial view features according to the first perception result;
the second extraction module is used for determining the point cloud aerial view characteristics corresponding to the point cloud data according to the point cloud data;
the characteristic fusion module is used for fusing the image aerial view characteristic and the point cloud aerial view characteristic to obtain a target aerial view characteristic;
and the second perception module is used for determining a target perception result according to the target aerial view characteristic.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202210898481.7A 2022-07-28 2022-07-28 Perception result obtaining method and device, computer equipment and storage medium Pending CN115240168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210898481.7A CN115240168A (en) 2022-07-28 2022-07-28 Perception result obtaining method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210898481.7A CN115240168A (en) 2022-07-28 2022-07-28 Perception result obtaining method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115240168A true CN115240168A (en) 2022-10-25

Family

ID=83677016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210898481.7A Pending CN115240168A (en) 2022-07-28 2022-07-28 Perception result obtaining method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115240168A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363615A (en) * 2023-03-27 2023-06-30 小米汽车科技有限公司 Data fusion method, device, vehicle and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363615A (en) * 2023-03-27 2023-06-30 小米汽车科技有限公司 Data fusion method, device, vehicle and storage medium
CN116363615B (en) * 2023-03-27 2024-02-23 小米汽车科技有限公司 Data fusion method, device, vehicle and storage medium

Similar Documents

Publication Publication Date Title
JP7430277B2 (en) Obstacle detection method and apparatus, computer device, and computer program
CN111666921B (en) Vehicle control method, apparatus, computer device, and computer-readable storage medium
CN109086668B (en) Unmanned aerial vehicle remote sensing image road information extraction method based on multi-scale generation countermeasure network
Marcu et al. SafeUAV: Learning to estimate depth and safe landing areas for UAVs from synthetic data
EP3211596A1 (en) Generating a virtual world to assess real-world video analysis performance
CN111027401A (en) End-to-end target detection method with integration of camera and laser radar
CN112528878A (en) Method and device for detecting lane line, terminal device and readable storage medium
CN112912890A (en) Method and system for generating synthetic point cloud data using generative models
CN113706480A (en) Point cloud 3D target detection method based on key point multi-scale feature fusion
CN115147328A (en) Three-dimensional target detection method and device
CN115240168A (en) Perception result obtaining method and device, computer equipment and storage medium
CN114663598A (en) Three-dimensional modeling method, device and storage medium
CN116246119A (en) 3D target detection method, electronic device and storage medium
CN115346184A (en) Lane information detection method, terminal and computer storage medium
CN115249269A (en) Object detection method, computer program product, storage medium, and electronic device
Schennings Deep convolutional neural networks for real-time single frame monocular depth estimation
CN115236672A (en) Obstacle information generation method, device, equipment and computer readable storage medium
CN116778262B (en) Three-dimensional target detection method and system based on virtual point cloud
CN115984583B (en) Data processing method, apparatus, computer device, storage medium, and program product
CN116612059B (en) Image processing method and device, electronic equipment and storage medium
CN117218616A (en) Lane line detection method, lane line detection device, computer equipment and storage medium
CN115019034A (en) Detection model training method and device and object detection method and device
CN116824286A (en) Efficient method and system for constructing overlook feature map of multi-view camera
CN115131594A (en) Millimeter wave radar data point classification method and device based on ensemble learning
CN117315402A (en) Training method of three-dimensional object detection model and three-dimensional object detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination