WO2024045942A1 - 环境信息感知方法、装置、系统、计算机设备及存储介质 - Google Patents

环境信息感知方法、装置、系统、计算机设备及存储介质 Download PDF

Info

Publication number
WO2024045942A1
WO2024045942A1 PCT/CN2023/108450 CN2023108450W WO2024045942A1 WO 2024045942 A1 WO2024045942 A1 WO 2024045942A1 CN 2023108450 W CN2023108450 W CN 2023108450W WO 2024045942 A1 WO2024045942 A1 WO 2024045942A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
radar
visual
image
features
Prior art date
Application number
PCT/CN2023/108450
Other languages
English (en)
French (fr)
Inventor
李若瑶
黄河
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2024045942A1 publication Critical patent/WO2024045942A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads

Definitions

  • Embodiments of the present application relate to the field of computer technology, and in particular to an environmental information sensing method, device, system, computer equipment and storage medium.
  • Autonomous driving technology aims to enable vehicles to have the ability to make autonomous judgments, autonomous controls, and autonomous driving, and has broad development prospects.
  • the automatic driving system consists of three subsystems: environmental perception, planning and decision-making, and control. Among them, the perception system can understand the surrounding environment of the vehicle and identify and locate targets and obstacles, which is an important part of automatic driving.
  • the perception of the surrounding environment is usually achieved through different sensors.
  • Sensors include optical cameras, lidar, four-dimensional (4-dimentional, 4D) millimeter wave radar, etc.
  • the image acquired by the optical camera is purely visual data and lacks depth information of the target object in the environment.
  • the imaging quality of the optical camera is easily affected by the external environment, and the imaging quality is unstable. Radar can better obtain the three-dimensional spatial information of the environment, but it cannot obtain the texture and color information of targets in the environment, and the detection data will be affected by the detection distance. Therefore, environmental perception based on pure visual data (referred to as visual images in this application) or environmental perception based on radar data is unstable, resulting in poor environmental information perception and high safety risks for autonomous driving.
  • Embodiments of the present application provide an environmental information sensing method, device, system, computer equipment and storage medium, which can improve the stability and accuracy of environmental sensing and improve the effect of sensing environmental information.
  • the technical solution is as follows:
  • an environmental information sensing method includes: obtaining the driving environment radar data and visual images in; determine the depth map of the radar data; perform a first weighted fusion process on the depth map and the visual image to obtain a first visual feature, wherein the first visual feature is used for Characterize the depth information and visual information corresponding to the visual image; perform a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain a fusion feature, wherein the first radar feature is To characterize the spatial context information of the radar data in the three-dimensional space; perform environmental perception based on the fusion features to obtain the environmental information in the driving environment.
  • an environmental information sensing device includes: a data acquisition module for acquiring radar data and visual images in the driving environment; a depth map determination module for determining the depth map of the radar data. ; The first weighted fusion module is used to perform a first weighted fusion process on the depth map and the visual image to obtain a first visual feature, where the first visual feature is used to characterize the depth corresponding to the visual image.
  • a second weighted fusion module for performing a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain a fusion feature, wherein the first radar feature is for characterizing the spatial context information of the radar data in the three-dimensional space; and an environment perception module for performing environment perception based on the fusion features to obtain environmental information in the driving environment.
  • an environmental information sensing system includes an image acquisition device, a radar device and an environment sensing device; the image acquisition device is used to collect visual images in the driving environment and convert the visual images into Sent to the environment sensing device; the radar device is used to collect radar data in the driving environment and send the radar data to the environment sensing device; the environment sensing device is used to receive the The visual image sent by the image acquisition device and the radar data sent by the radar device; determining a depth map of the radar data; performing a first weighted fusion process on the depth map and the visual image to obtain a first Visual features, wherein the first visual features are used to characterize the depth information and visual information corresponding to the visual image; perform a second weighted fusion process on the first radar features and the first visual features of the radar data, Fusion features are obtained, where the first radar features are used to characterize the spatial context information of the radar data in the three-dimensional space; environmental perception is performed based on the fusion features to obtain environmental information in the driving environment.
  • a computer device includes a processor and a memory.
  • the memory stores at least one computer program.
  • the at least one computer program is loaded and executed by the processor to realize the above environment. Information sensing methods.
  • a computer-readable storage medium in which at least one computer program is stored, and the computer program is loaded and executed by a processor to implement the above-mentioned environmental information sensing method.
  • the computer program product includes at least one computer program.
  • the computer program is loaded and executed by a processor to implement the environmental information sensing method provided in the above various optional implementations.
  • Figure 1 shows a schematic diagram of an environmental information sensing system corresponding to an environmental information sensing method provided by an exemplary embodiment of the present application
  • Figure 2 shows a flow chart of an environmental information sensing method according to an exemplary embodiment of the present application
  • Figure 3 shows a flow chart of another environmental information sensing method according to an exemplary embodiment of the present application
  • Figure 4 shows a schematic diagram of the acquisition process of fusion features provided by an exemplary embodiment of the present application
  • Figure 5 shows a schematic diagram of an environment perception model implemented in an exemplary embodiment of the present application
  • Figure 6 shows a schematic diagram of an environmental information sensing module provided by an exemplary embodiment of the present application
  • Figure 7 shows a schematic diagram of the implementation process of the environmental information sensing method in a closed single scenario provided by an exemplary embodiment of the present application
  • Figure 8 shows a schematic diagram of a fusion feature acquisition process provided by an exemplary embodiment of the present application.
  • Figure 9 shows a schematic diagram of the implementation process of the environmental information sensing method in an open complex scenario provided by an exemplary embodiment of the present application.
  • Figure 10 shows a schematic diagram of another fusion feature acquisition process provided by an exemplary embodiment of the present application.
  • Figure 11 shows a block diagram of an environmental information sensing device provided by an exemplary embodiment of the present application.
  • Figure 12 is a structural block diagram of a computer device according to an exemplary embodiment.
  • Figure 1 shows a schematic diagram of an environmental information sensing system corresponding to the environmental information sensing method provided by an exemplary embodiment of the present application.
  • the system may include an image acquisition device 110, a radar device 120 and an environmental information sensing device 130.
  • the image acquisition device 110 and the radar device 120 may be sensors installed on a target device; illustratively, the target device may be a vehicle, ship, or other means of transportation in a driving scenario.
  • the image acquisition device 110 is a device with an image acquisition function, such as an optical camera, a camera, a terminal with an image acquisition function, etc.
  • the image acquisition device 110 is used to collect visual images in the driving environment and send the visual images.
  • the radar device 120 may be a device that emits beams to the surrounding environment and receives reflected echoes to form radar data.
  • the radar device 120 is used to collect radar data in the driving environment and send the radar data to the environment.
  • Sensing device 130 illustratively, the radar device 120 may be a lidar, a 4D millimeter wave radar, etc., and the radar data may be point cloud data.
  • the environmental information sensing device 130 can communicate with the image acquisition device 110 and the radar device 120 respectively through a wired network or a wireless network, and is used to receive visual images sent by the image acquisition device 110 and receive radar data sent by the radar device 120, and perform The received visual images and radar data are subjected to feature extraction and multi-level weighted fusion operations to obtain fusion features, and then based on the fusion features, environmental information perception of the driving environment is realized.
  • the process includes: receiving images sent by the image acquisition device The visual image and the radar data sent by the radar equipment; determine the depth map of the radar data; perform a first weighted fusion process on the depth map and the visual image to obtain the first visual feature, where the first visual feature is used to characterize the visual image correspondence Depth information and visual information; perform a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain the fusion feature, where the first radar feature is used to characterize the spatial context of the radar data in the three-dimensional space Information; perform environmental perception based on fusion features to obtain environmental information in the driving environment.
  • the environment information sensing device 130 can be implemented as a terminal or a server.
  • the environment information sensing device 130 can be an independent physical server or multiple physical servers.
  • Server clusters or distributed systems composed of servers can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network) , content distribution network), as well as cloud servers for basic cloud computing services such as big data and artificial intelligence platforms;
  • the environmental information sensing device 130 is a terminal, the environmental information sensing device 130 can be implemented as a smartphone, tablet computer, laptop computer , desktop computers, vehicle-mounted terminals, etc.
  • This application does not limit the implementation form of the environmental information sensing device 130.
  • FIG 2 shows a flow chart of an environmental information sensing method according to an exemplary embodiment of the present application.
  • the environmental information sensing method can be executed by a computer device.
  • the computer device can be implemented as an environmental information sensing method as shown in Figure 1 Device 130, as shown in Figure 2, the environmental information sensing method may include the following steps:
  • Step 210 Obtain radar data and visual images in the driving environment.
  • the driving environment may be a driving environment in an autonomous driving scenario or a driving environment in an assisted driving scenario, which is not limited by this application.
  • radar data is collected through radar equipment, and visual images are collected through image collection equipment; in the driving scene, the computer equipment can receive the visual images uploaded in real time by the image collection equipment and the radar equipment that establish a communication relationship with the computer equipment, and Radar data, resulting in visual images as well as radar data.
  • the radar device and the visual image acquisition device can be installed on the target device; schematically, the target device can be a vehicle equipped with an optical camera and a radar. This application does not limit the implementation form of the target device.
  • Step 220 Determine the depth map of the radar data.
  • the process of determining the depth map of radar data can be implemented as: mapping the radar data to the map Image plane, obtain the pixel position of the radar data; determine the pixel value of each radar data at the pixel position based on the spatial position of the radar data and the distance between the radar equipment in the horizontal direction; determine the pixel value of each radar data at the pixel position based on the pixel position and the pixel position of each radar data The pixel value at the point position generates a depth map.
  • the method before determining the depth map of the radar data, also includes: performing parameter calibration and time synchronization of the radar equipment and the image acquisition equipment to obtain a projection transformation matrix; the projection transformation matrix is used to realize planar mapping of the radar data, At the same time, the depth map obtained by plane mapping corresponds to the coordinates of the visual image collected by the image acquisition device.
  • the position vector of the three-dimensional (3D) point cloud data i.e.
  • the computer device can project the point cloud data to the image plane according to the known projection transformation matrix to obtain the corresponding position of the point cloud data in the image plane; then, calculate the corresponding position of the point cloud data.
  • the distance value between the spatial position and the horizontal direction of the radar device; based on this distance value, the pixel value of the pixel point at the corresponding position of the point cloud data in the image plane is obtained.
  • the distance value can be linearly scaled so that the pixel value corresponding to each point cloud data is in the range of 0 to 255. between; based on the corresponding positions of the point cloud data in the image plane and the pixel values of the pixels at each corresponding position, a depth map in the form of a grayscale image is generated.
  • Step 230 Perform a first weighted fusion process on the depth map and the visual image to obtain a first visual feature, where the first visual feature is used to represent the depth information and visual information corresponding to the visual image.
  • the computer device can perform a first weighted fusion process on the depth map of the radar data and the visual image, and enhance the visual features through the depth information in the visual image, thereby obtaining the first visual feature that can reflect the depth information and the visual information.
  • Step 240 Perform a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain a fusion feature, where the first radar feature is used to represent the spatial context information of the radar data in the three-dimensional space.
  • the spatial context information may include spatial information and distribution information.
  • the computer device after obtaining the first radar feature and the first visual feature, in order to further realize the information complementation between the multi-modal features and improve the accuracy of the perception task, the computer device will compare the first radar feature and the first visual feature.
  • the features are further subjected to weighted fusion processing, that is, a second weighted fusion is performed to obtain fused features.
  • the fusion feature is used to represent the environmental information in the driving environment; it should be noted that the environmental information indicated by the fusion feature is determined based on the environmental perception task that needs to be performed in the driving scene; schematically, when the perception task is When performing a target detection task, the fusion feature contains the information required to determine the location and category of the target object; for example, when the perception task is a lane line recognition task, the fusion feature is used to represent the lane line position information in the driving environment.
  • Step 250 Perform environment perception based on the fusion features to obtain environmental information in the driving environment.
  • environment perception can include but is not limited to target detection, semantic segmentation and other perception tasks.
  • the computer device can perform different environment perception operations on the fused features based on different perception tasks to complete the corresponding perception tasks and obtain the information in the driving environment. environmental information.
  • the environmental information sensing method determines the depth map of the radar data after acquiring the radar data and visual images in the driving environment; and performs the first step on the depth map and visual image of the radar data.
  • the first weighted fusion is performed once to obtain the first visual feature; the first radar feature and the first visual feature are then weighted for the second time to obtain the fused feature.
  • environmental perception is performed based on the fused feature to obtain environmental information in the driving environment; in the above method, the computer equipment achieves full utilization and information complementation of data collected by different sensors through multi-level weighted fusion of visual images and radar data, thus improving the accuracy and robustness of the perception task, thereby improving the driving efficiency. security.
  • FIG 3 shows a flow chart of another environmental information sensing method according to an exemplary embodiment of the present application.
  • the environmental information sensing method can be executed by a computer device.
  • the computer device can be implemented as the environmental information as shown in Figure 1 Sensing device 130, as shown in Figure 3, the environmental information sensing method may include the following steps:
  • Step 310 Obtain radar data and visual images in the driving environment.
  • radar data is collected by computer equipment through radar equipment
  • visual images are collected by computer equipment through image acquisition equipment.
  • the radar data acquired by the radar equipment is point cloud data, distributed in length and width.
  • the visual image acquired by the image acquisition device can be an RGB image or a grayscale image, which is not limited in this application.
  • Step 320 Perform feature encoding on the radar data to obtain the first radar feature of the radar data.
  • the first radar feature is used to characterize the spatial context information of the radar data in the three-dimensional space.
  • the computer device can directly perform feature encoding on the obtained radar data one by one to obtain the first radar feature; or, in order to improve the coding efficiency and save the calculation amount of the computer device, the computer device can also perform feature encoding on the obtained radar data.
  • the three-dimensional space in which it is located is spatially divided, and the radar data groupings in several voxels obtained by the spatial division are feature encoded to obtain enhanced radar features, that is, the first radar features; schematically, the radar data is distributed in a length, width and height of In the space of L ⁇ W ⁇ H, the computer equipment can evenly divide the space into several cuboids with the granularity of V L ⁇ V W ⁇ V H.
  • Each cuboid is called a voxel, and each voxel can contain several radars. data, or does not contain radar data; in the embodiment of the present application, voxels containing radar data are called non-empty voxels; feature encoding is performed on the radar data in each non-empty voxel in units of voxels, The encoded features of each non-empty voxel are obtained, so that a feature set composed of features corresponding to multiple non-empty voxels is determined as the first radar feature.
  • the process of feature encoding the radar feature can be implemented as follows: divide the three-dimensional space of the point cloud data to obtain multiple voxels; the multiple voxels contain at least one non-empty body.
  • Each non-empty voxel contains at least one point cloud data; at least one set of sub-radar data is encoded in units of voxels to obtain at least one set of sub-radar features corresponding to the sub-radar data; the sub-radar data refers to the point cloud data in a non-empty voxel; the sub-radar feature contains the spatial context information of the local three-dimensional space; the local three-dimensional space is the three-dimensional space occupied by the non-empty voxel corresponding to the sub-radar feature; at least one sub-radar feature
  • the feature set composed of radar features is determined as the first radar feature.
  • the computer device when it encodes at least one set of sub-radar data in units of voxels, it can input at least one set of sub-radar data into the feature coding network to obtain the coded corresponding to each non-empty voxel output by the feature coding network.
  • characteristics that is, sub-radar characteristics, and the feature set composed of sub-radar features is determined as the first radar feature.
  • the feature encoding network may include multiple stacked fully connected layers to gradually expand the feature dimensions of the point cloud data when processing the point cloud data; wherein, the fully connected layer
  • the parameter information and number of layers of the connection layer can be trained and changed according to the data and tasks processed, and this application does not limit this.
  • Step 330 Determine the depth map of the radar data.
  • the computer device can also perform depth completion on the depth map; where depth completion refers to interpolating the sparse depth map. , to reduce the lack of depth in the depth map, thereby obtaining richer depth information, which is beneficial to enriching the amount of information contained in the first visual feature, thereby improving the accuracy of the overall environment perception task.
  • Step 340 Splice the single-channel image information of the depth map and at least one channel of image information of the visual image to obtain a spliced image.
  • the number of channels of the spliced image is the sum of the number of channels of the depth map and the number of channels of the visual image.
  • the computer device in the process of splicing the depth map and the visual image, can first convert the depth map and the visual image into a depth data matrix and a visual data matrix respectively, where the data dimension of the depth data matrix is [ height, width, 1], when the visual image is an RGB image, the data dimension of the visual data matrix is [height, width, 3], where height*width (height*width) is the number of pixels of the image, 1 and 3 respectively.
  • Represents the number of channels of the image and then after splicing the single-channel image information of the depth map and the multi-channel image information of the visual image, a spliced image with a data dimension of [height, width, 4] is obtained.
  • This application does not limit the splicing order of the single-channel image information of the depth map and the multi-channel image information of the visual image.
  • Step 350 Perform weighted fusion processing on the image information in the spliced image to obtain the first visual feature.
  • the image information in the spliced image includes single-channel image information of the depth map and at least one channel image information of the visual image.
  • the computer device can perform weighted fusion processing on the image information in the spliced image through a weighted fusion network to obtain the first visual feature.
  • the weighted fusion network can be implemented as stacked convolutional layers, and the weighted fusion network can adaptively perform weighted fusion of image information in the spliced image to obtain the first visual feature.
  • Step 360 Perform a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain a fusion feature.
  • the weighted fusion of radar features and visual features can be achieved only if the spatial dimensions of radar features and visual features are the same. Therefore, if the fusion feature is a feature obtained by weighted fusion of the first radar feature and the first visual feature, before performing a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain the fusion feature, this method It also includes: adjusting the first radar feature to obtain the first radar feature and the first visual feature with the same spatial dimension; performing a second weighted fusion of the first radar feature and the first visual feature with the same spatial dimension to obtain the fusion feature.
  • the process of processing the spliced image by the weighted fusion network may involve a downsampling process, so that there may be a spatial dimension difference between the obtained first visual feature and the visual image. Difference, therefore, the first radar feature and the first visual feature with the same spatial dimension can be obtained in different ways under different circumstances; the process can be implemented as:
  • the first radar feature is mapped to the image plane according to the projection transformation matrix, and the first radar feature and the first visual feature with the same spatial dimension can be obtained. ; If there is a difference between the spatial dimension of the first visual feature and the spatial dimension of the visual image, the first radar feature needs to be adjusted so that the dimensions of the adjusted first radar feature and the first visual feature space are consistent, that is, in According to the projection transformation matrix, after mapping the first radar feature into the image plane, the spatial dimension of the mapped first radar feature is adjusted.
  • the adjusted first radar feature has the same spatial dimension as the first visual feature, so that Obtain the radar features and visual features corresponding to the spatial location.
  • the first visual feature when there is a difference in the spatial dimension of the first visual feature compared to the spatial dimension of the visual image, the first visual feature can also be adjusted so that the adjusted spatial dimensions of the first visual feature and the first radar feature Consistent, thereby obtaining the radar features and visual features corresponding to the spatial position.
  • the computer device can sequentially perform weighting processing on the first radar feature, weighting processing on the first visual feature, and weighting processing on the first visual feature.
  • Multi-modal features are subjected to weighted fusion processing; that is to say, the process of obtaining fusion features can be implemented as: based on the detection distance indicated by the depth map, weighting the first radar features to obtain the second radar features; based on the second radar Features, perform weighting processing on the first visual features to obtain the second visual features; perform weighted fusion processing on the second radar features and the second visual features to obtain the fusion features.
  • the radar features and visual features need to be spatially dimensioned before weighting. degree of unity; that is, ensuring that the spatial dimensions of the depth map and the first radar feature are consistent; ensuring that the spatial dimensions of the second radar feature and the first visual feature are consistent; therefore, the process can be implemented as: according to the projection transformation matrix, the first radar The features are mapped into the image plane to obtain the mapped first radar feature; the mapped first radar feature has the same spatial dimension as the depth map, and the mapped first radar feature can be weighted based on the detection distance indicated by the depth map, Obtain the second radar feature; if the spatial dimension of the first visual feature is the same as the spatial dimension of the visual image, then the spatial dimension of the second radar feature is the same as the spatial dimension of the first visual feature, which can be directly based on the second radar Features, perform weighting processing on the first visual features to obtain the second visual features.
  • the spatial dimension of the first visual feature is different from the spatial dimension of the visual image
  • the spatial dimension of the second radar feature is different from the spatial dimension of the first visual feature.
  • the spatial dimension of the first visual feature can also be adjusted so that the adjusted spatial dimension of the first visual feature is the same as that of the second radar feature. are the same, so based on the second radar feature, the adjusted first visual feature is weighted to obtain the second visual feature.
  • the first radar feature involved in the following embodiments defaults to the mapped first radar feature, and the second radar feature defaults to the adjusted second radar feature; that is, the first radar feature has the same spatial dimension as the depth map, and the second radar feature defaults to the adjusted second radar feature.
  • the spatial dimensions of the features are the same as those of the first visual features, and no distinction is made between before mapping, after mapping and after adjustment.
  • the weighting weight when weighting the first radar feature is determined based on the detection distance indicated by the depth map; where the detection distance is used to indicate the distance of the target object in the three-dimensional space.
  • the depth map of the radar data directly reflects the distance information of the target object in the environment from the observation point (radar device), and also contains the object outline information in the area of interest. Therefore, the computer device can obtain the depth map of the radar data as the first
  • the prior information of the radar features is combined with the prior information to weight the first radar features to obtain the second radar features.
  • the detection distance is inversely correlated with the weighted weight of the first radar feature; that is to say, the size of the weighting weight corresponding to the feature data in the first radar feature is inversely related to the size of the detection distance corresponding to the feature data; schematically , the computer device can assign a smaller weight to the feature data corresponding to the area with a farther detection distance, and give a larger weight to the feature data corresponding to the area with a closer detection distance.
  • the computer device can input the depth map into the weight assignment network to obtain the weighted weight of the first radar feature output by the weight assignment network; the weight assignment network can make full use of the depth map. distance information and object contour information, thereby improving the accuracy of assigning weights.
  • the embodiment of the present application provides a method of using radar features for assistance, combined with the own characteristics of the first visual features, to jointly control the first visual The way features are weighted.
  • the computer device can add the second radar feature and the first visual feature, and process the added output (added feature) using a weight acquisition network to obtain a weighted weight for each area in the visual image. , multiply the weighted weight by the corresponding position of the first visual feature to weight the first visual feature, and obtain the second visual feature.
  • weighting the first visual feature can be implemented as the following process: adding the second radar feature and the first visual feature to obtain the added feature; where the channel number of the second radar feature is the same as the first visual feature.
  • the number of channels of the visual features is the same; the spatial dimension of the second radar feature is the same as the spatial dimension of the first visual feature; the added features are input into the weight acquisition network to obtain the weighted weight of each area in the visual image; according to the weight of each area Weight: weight the first visual features of each area to obtain the second visual features.
  • the visual image contains a target area; the target area is an area in the visual image where the image quality is lower than the image quality threshold; the second visual feature value of the target area in the second visual feature is smaller than the target area in The first visual feature value in the first visual feature.
  • target areas with lower image quality in the visual image are given lower weights to reduce the influence of visual features in target areas with poorer image quality.
  • the addition of the second radar feature and the first visual feature may be the addition feature obtained by adding the second radar feature and the first visual feature channel by channel; therefore, it is necessary to ensure that the second radar feature
  • the number of channels and spatial dimensions of the feature and the first visual feature are consistent; when the spatial dimensions of the second radar feature and the first radar feature are consistent, if the number of channels of the second radar feature and the first visual feature are consistent , you can directly perform the addition operation to obtain the addition feature; if the number of channels of the second radar feature and the first visual feature are inconsistent, you can use the convolutional neural network before adding the second radar feature and the first visual feature.
  • the network processes the second radar feature and the first visual feature separately, so that the second radar feature and the first visual feature have the same number of channels.
  • the additive feature obtained by adding the second radar feature and the first visual feature also has the same number of channels as the second radar feature and the first visual feature.
  • each region in the visual feature can correspond to the region where the voxels of the radar data are mapped onto the image plane.
  • the weight acquisition network is used to extract the weight of the additive features and obtain the weighted weights of each area in the visual image.
  • Radar data can better retain the three-dimensional geometric information of targets in the environment, while visual images can better retain the visual information of targets in the environment; different perception tasks also place different emphasis on the characteristics of different sensors. , Therefore, when performing multi-modal feature fusion, the multi-modal features can be weighted fused based on the requirements of the perception task to obtain fusion features that are adapted to the requirements of the perception task.
  • the process of obtaining fusion features can be implemented as follows: splicing the second radar feature and the second visual feature to obtain the splicing feature; wherein, the splicing feature Have at least two channels; perform global average pooling on the spliced features to obtain the pooled features; input the pooled features into the fully connected layer for processing to obtain the processed features; perform nonlinear transformation on the processed features to obtain the spliced features The weighted weight of each channel of the feature; according to the weighted weight of each channel of the spliced feature, the spliced feature is weighted to obtain the fusion feature.
  • splicing the second radar feature and the second visual feature to obtain the spliced feature may mean splicing the second radar feature and the second visual feature in the channel dimension to obtain the spliced feature; the channel of the spliced feature
  • the number is the sum of the number of channels of the second radar feature and the number of channels of the second visual feature.
  • the number of channels of the second radar feature and the number of channels of the second visual feature are 64 respectively, then the number of channels of the obtained splicing feature is 128.
  • the above-mentioned process of obtaining the weighted weight of each channel of the splicing feature based on the splicing feature can be implemented as follows: input the splicing feature into the weight extraction network, and obtain the weighted weight of each channel of the splicing feature output by the weight extraction network; the weight extraction network It can include pooling layer, full connection layer and nonlinear transformation layer; based on this, the processing process of the spliced features by the weight extraction network can be implemented as follows: perform global average pooling on the spliced features through the pooling layer to obtain the pooled features; input the pooled features into the fully connected Layers are processed, and the processed features obtained after multi-layer fully connected layers are input to the nonlinear transformation layer. The processed features are nonlinearly transformed through the nonlinear transformation layer to obtain the weighted weights of each channel of the spliced image.
  • the weight extraction network is trained based on perceptual task requirements. This weight extraction network can adjust the weights of features from different sensors differently based on different sensing tasks to improve the completion of the corresponding sensing tasks.
  • Figure 4 shows a schematic diagram of the acquisition process of fusion features provided by an exemplary embodiment of the present application.
  • the computer device acquires the first visual feature 410, the first radar feature 420 and the depth map of the radar data.
  • the depth map 430 can be used as the prior information of the first radar feature 410, and the first radar feature 420 can be weighted to obtain the second radar feature 440; then, the second radar feature 440 can be used as an auxiliary feature to obtain the second radar feature 440.
  • the first visual feature 410 is weighted to obtain the second visual feature 450; then, the second radar feature 440 and the second visual feature 450 are weighted fused, that is, multi-modal feature weighted fusion is performed to obtain the fusion feature 460.
  • Step 370 Perform environment perception based on the fusion features to obtain environmental information in the driving environment.
  • the environmental information may include at least one of position information and quantity information of the target object in the driving environment.
  • the computer device can input the fused features into the perception network for processing to realize environmental information perception.
  • the perception network can be a network structure designed based on different perception tasks.
  • the perception network is used to process fused features and output detection results to complete the corresponding perception tasks; where the perception tasks include but are not limited to target detection, Semantic segmentation, etc.
  • the perception task can be implemented as the perception of lane line information, the perception of signal light information, the perception of the number and location of pedestrians on the road, etc. This application does not limit this.
  • the computer device can realize the perception of the driving environment through an environment perception model.
  • the environment perception model can include the feature encoding network, weighted fusion network, and weight assignment network involved in the above embodiments.
  • Network structures such as weight extraction network, weight acquisition network and perception network; this environment perception model is used to process the radar data and visual images after receiving the radar data and visual images of the driving environment to complete the corresponding perception tasks.
  • Figure 5 shows a schematic diagram of an environment perception model implemented in an exemplary embodiment of the present application.
  • the environment perception model 500 includes: a feature encoding network 510, a weighted fusion network 520, a weight assignment network 530, a weight extraction network 540, a weight acquisition network 550 and a perception network 560; schematically, the environment perception model 500 acquires the driving environment After obtaining the radar data and visual images, after spatially dividing the radar data, the voxel-level radar data are respectively input into the feature encoding network 510 to obtain the radar data at each voxel level output by the feature encoding network 510 (i.e., sub- (radar data) sub-radar features, and obtain the set of sub-radar data as the first radar feature; at the same time, after plane mapping the radar data to obtain the depth map, the depth map and the visual image are spliced in the channel dimension to obtain the spliced image Finally, the spliced image is input into the weighted fusion network 520 to obtain the first visual feature output by the weighted fusion network
  • This first visual feature can represent depth information and visual information; in order to implement weighted processing of the first radar feature, The depth map is input into the weight assignment network 530 to obtain the weighted weight of the first radar feature output by the weight assignment network 530, so as to weight the first radar feature based on the weighted weight to obtain the second radar feature; in order to achieve the first vision
  • the second radar feature and the first visual feature can be added channel by channel first to obtain the added feature, and then the added feature can be input into the weight acquisition network 550 to obtain the visual image output by the weight acquisition network 550
  • the weighted weight of each area in is used to weight the first visual feature based on the weighted weight to obtain the second visual feature; after obtaining the second radar feature and the second visual feature, the two are spliced in the channel dimension to obtain Splice the features, and input the spliced features into the weight extraction network 540 to obtain the weighted weights of each channel output by the weight extraction network 540, so as to realize feature fusion between
  • the environment perception model can be a machine model trained through an end-to-end model training method; that is, the environment perception model is trained based on sample visual images, sample radar data and perception result labels.
  • the perception model The result label can be set based on different perception tasks; schematically, when the perception task is target detection, the perception task label can be set to the location label and category label of the target object in the environment.
  • the trained environment perception The model is used to detect the location and category of target objects in the driving environment. That is to say, the environmental information of the driving environment is the location information and category information of the target objects in the driving environment.
  • the network parameters of each network included in the environment awareness model are adjusted to realize the weight adjustment of the multi-modal features, so that the environment awareness model after training can be better completed Corresponding perception tasks.
  • the environmental information sensing method determines the depth map of the radar data after acquiring the radar data and visual images in the driving environment; and performs the first step on the depth map and visual image of the radar data.
  • the first weighted fusion is performed once to obtain the first visual feature; the first radar feature and the first visual feature are then weighted for the second time to obtain the fused feature.
  • environmental perception is performed based on the fused feature to obtain environmental information in the driving environment;
  • the computer equipment achieves full utilization and information complementation of data collected by different sensors through multi-level weighted fusion of visual images and radar data, thereby improving the accuracy and robustness of the perception task, thereby improving the driving efficiency. security.
  • FIG. 6 shows a schematic diagram of an environmental information sensing module provided by an exemplary embodiment of the present application.
  • the environmental information sensing module can be applied to a computer device (such as the environmental information sensing device 130 in the environmental information sensing system).
  • a computer device such as the environmental information sensing device 130 in the environmental information sensing system.
  • the environmental information sensing module may include a projection transformation component 610, a first weighted fusion component 620, and a radar data encoding component 630.
  • the horizontal distance value of the radar device (such as a vehicle) is linearly scaled to obtain the pixel value of the point cloud data, thereby generating depth in the form of a grayscale image based on the corresponding position and pixel value of the point cloud data in the image plane.
  • the first weighted fusion component 620 is used to perform weighted fusion and processing of the depth map of the radar data and the visual image to obtain visual features enhanced by the depth map, that is, the first visual feature;
  • the radar data encoding component 630 is used to combine The radar data is divided into a number of voxels according to spatial location distribution, and the radar data in each voxel is feature-encoded through a neural network structure to obtain an enhanced radar feature with spatial context information, that is, the first radar feature;
  • the second weighted fusion component 640 used to weight the first radar feature and the first visual feature respectively according to their own characteristics, and then perform multi-modal processing on the weighted radar feature (second radar feature) and the weighted visual feature (second visual feature)
  • Weighted fusion of features outputs the final fused features;
  • the perceptual network component 650 is used to take the final fused features as input, use a neural network structure to process the fused features, and perform perceptual tasks including but not limited to target
  • Figure 2 Taking the scenario of environmental perception of the autonomous driving environment in the field of autonomous driving as an example, Figure 2 and The environmental information sensing method provided by the embodiment shown in Figure 3 can be applied in the following scenarios. It should be noted that the scenario applications provided in this application are only illustrative, and this application does not limit the application scenarios of the environmental information sensing method.
  • the following embodiments illustrate a method for realizing environmental information sensing based on the environmental information sensing module shown in Figure 5:
  • the targets and obstacles that autonomous driving equipment needs to identify are relatively single. At the same time, the targets or obstacles in these scenes are usually large in size. Based on this characteristic, The accuracy requirements for sensor equipment can be appropriately reduced to save equipment deployment costs.
  • high-precision lidar is more expensive. Therefore, lidar with a lower line count or 4D millimeter-wave radar can be used instead of high-precision lidar for radar data collection, combined with the vision collected by optical cameras. images for environmental perception.
  • Figure 7 shows a schematic diagram of the implementation process of the environmental information sensing method in a closed single scenario provided by an exemplary embodiment of the present application.
  • the implementation process of the environmental information sensing method can be implemented as:
  • S701 Obtain the radar data, visual image and projection transformation matrix that have been parameter calibrated and time synchronized.
  • the radar data and visual images are obtained by the autonomous driving equipment in the process of autonomous driving through its own radar equipment and image acquisition equipment.
  • Point cloud data i.e., radar data
  • pixels in the visual image can be spatially mapped through the projection transformation matrix P.
  • the projection transformation component can project the radar data to the image plane according to the projection transformation matrix P, obtain the corresponding position of the point cloud data in the image plane, and calculate the horizontal distance value between the collection position of the point cloud data and the radar equipment, and use this horizontal distance
  • the value is used as the pixel value of the depth map, and the range of the pixel value is linearly scaled so that it is within the range of 0-255.
  • the scaled pixel value is obtained as the pixel value of the depth map, and then based on the point cloud
  • the corresponding position and pixel value of the data in the image plane are used to generate a depth map.
  • the projection transformation component can use traditional image processing methods, including image expansion, morphological closing operations, median filtering, Gaussian filtering, etc., to complete the sparse depth map and form a dense depth map.
  • traditional image processing methods The parameters, such as the size and shape of the core used, can be adjusted based on actual conditions, and are not limited in this application.
  • the first weighted fusion component first splices the visual image and the depth map in the channel dimension to obtain the spliced image; then uses stacked convolution layers (i.e., weighted fusion network) to process the spliced image, and adaptively weights and fuses the depth information and the image information to obtain enhanced visual features (i.e., first visual features).
  • stacked convolution layers i.e., weighted fusion network
  • S704 Input the radar data into the radar data encoding component for feature encoding.
  • the radar data is distributed in a space with a length, width and height of L ⁇ W ⁇ H.
  • the space is evenly divided into several cuboids with a granularity of V L ⁇ V W ⁇ V H.
  • Each cuboid is called is a voxel, each voxel contains several point cloud data or no point cloud data.
  • the radar data encoding component can be built based on the VoxelNet (voxel network) structure.
  • the radar data encoding component can use a multi-layer stacked fully connected layer network to encode non-empty voxels in the voxels.
  • sub-radar features which contains the spatial context information of non-empty voxels
  • S705 Input the depth map, the first visual feature and the first radar feature into the second weighted fusion component.
  • the first radar feature is first mapped to the image plane according to the projection transformation matrix.
  • This mapping can be a mapping performed in units of voxels, and each voxel corresponds to an area on the image plane; in this paper
  • the spatial dimension of the mapped first radar feature is the same as that of the first visual feature.
  • the first visual feature and the first radar feature are weighted according to their own characteristics respectively; schematically, the depth map is used as a priori information to weight the first radar feature to obtain the second radar feature; with the second radar feature as an aid,
  • the first visual feature is weighted based on its own characteristics to obtain the second visual feature, and then weighted fusion of multi-modal data is performed, that is, the second visual feature and the second radar feature are weighted and fused to obtain the fusion feature .
  • Figure 8 shows a schematic diagram of a fusion feature acquisition process provided by an exemplary embodiment of the present application. Part A in Figure 8 shows an implementation of radar feature weighting provided by an embodiment of the present application. As shown in part A of Figure 8, in the embodiment of the present application, when weighting the first radar feature, the depth map is used as a priori information, and the pixel values of the depth map are normalized using the following formula (2) :
  • d min is the minimum value among the depth image pixel values
  • d max is the maximum value among the depth image pixel values.
  • the re-weighted function (Re-weighted Function) of the following formula (3) is used to process the weighted weight, and the adjusted weighted weight d t is obtained:
  • the above-mentioned reweighting function can reduce the weight difference of point cloud data, while appropriately retaining the point cloud information of medium and long distances.
  • the weighted weight obtained by the above-mentioned reweighting function is compared with the first radar feature (the first radar feature may refer to Based on the projection transformation matrix, the mapped first radar features (mapped to the image plane) are spatially multiplied pixel by pixel, and the points with greater distance and noise in the radar data can be suppressed to obtain the second radar feature.
  • Part B in Figure 8 shows an implementation of visual feature weighting provided by the embodiment of the present application.
  • the process of visual feature weighting includes: 1) Use a convolution layer with a convolution kernel of 1 ⁇ 1 to process the second radar feature and the first visual feature respectively so that they have the same number of channels; 2) Add the first visual feature and the second radar feature channel by channel. Combine to obtain the additive features, and use a nonlinear activation function to process the additive features.
  • the nonlinear activation function can be a ReLU function, and then use a 3 ⁇ 3 convolution layer to process the output after the activation function.
  • the nonlinear activation function can be a Sigmoid function; 3) Apply the visual image to The weighted weight of each area in the target area is multiplied by the corresponding position of the first visual feature to complete the weighting of the first visual feature and obtain the second visual feature, and the second visual feature value of the target area is smaller than the first visual feature value.
  • the target area is the area in the visual image where the image quality is lower than the image quality threshold, thereby suppressing the impact of areas affected by strong light and shadow parts in the visual image on the perception task.
  • the image quality can be a score of the image in the target area, and the image quality threshold can be adaptively set by the network based on the actual situation, which is not limited in this application.
  • Part C in Figure 8 shows an implementation of the weighted fusion of visual radar features provided by the embodiment of the present application.
  • the second radar feature and the second visual feature are processed in the channel dimension.
  • the weight extraction network can be used to learn the weight of the splicing features; schematically, the weight extraction network can be SENet (Squeeze-and-Excitation Networks, squeeze excitation network), in this case,
  • SENet Sequeeze-and-Excitation Networks, squeeze excitation network
  • the weight extraction network first performs global average pooling on the spliced features, and then uses multiple fully connected layers and nonlinear activation functions to process the pooled features to obtain the weighted weights of each channel of the spliced features.
  • the weighted weights are related to the spliced features.
  • the features are multiplied channel by channel to obtain fused features.
  • S706 Input the fused features into the perception task network to complete the corresponding perception task.
  • the parameters of each network involved in the above steps can be obtained through a supervised end-to-end training process based on the perception task.
  • the training set for the model composed of the above network can include sample radar data, sample visual images and target object labels; when the trained model is applied to the above scene, the computer The device can determine target objects in the autonomous driving environment based on the radar data and visual images obtained from the autonomous driving environment.
  • Figure 9 shows a schematic diagram of the implementation process of the environmental information sensing method in an open complex scenario provided by an exemplary embodiment of the present application.
  • the implementation process of the environmental information sensing method can be implemented as:
  • S901 Obtain the radar data, visual image and projection transformation matrix that have been parameter calibrated and time synchronized.
  • S902 Input the radar data into the projection transformation component to obtain a depth map of the radar data.
  • the projection transformation component can project the radar data to the image plane according to the projection transformation matrix P, obtain the corresponding position of the point cloud data in the image plane, and calculate the horizontal distance value between the collection position of the point cloud data and the radar equipment, and use this horizontal distance
  • the value is used as the pixel value of the depth map, and the range of the pixel value is linearly scaled so that it is within the range of 0-255, and the scaled pixel value is
  • the pixel values of the depth map are obtained, and then a depth map is generated based on the corresponding positions and pixel values of the point cloud data in the image plane.
  • depth completion does not need to be performed to save data processing costs.
  • S903 Input the depth map and the visual image into the first weighted fusion component to obtain the first visual feature.
  • S904 Input the radar data into the radar data encoding component for feature encoding to obtain the first radar feature.
  • S905 Input the depth map, the first visual feature and the first radar feature into the second weighted fusion component to obtain the fusion feature.
  • Figure 10 shows a schematic diagram of another fusion feature acquisition process provided by an exemplary embodiment of the present application; Part A in Figure 10 shows an implementation of radar feature weighting provided by an embodiment of the present application.
  • the process of weighting the first radar feature includes: 1) Input the depth map into the Encoder-Decoder (encoder-decoder) network structure, and obtain the Encoder-Decoder The weighted weight output by the Decoder network structure, which consists of multiple convolutional layers, nonlinear activation functions, and up and down sampling layers.
  • the Encoder structure is used to encode and abstract the image information in the depth map.
  • the Decoder structure is used to decode and abstract the encoded information.
  • Restoration in the embodiment of this application, is to restore the abstracted features to their original dimensions.
  • the depth map is input into the Encoder-Decoder network, the corresponding weighted weights can be obtained.
  • This structure can make full use of the distance information and the contained contour information in the depth map.
  • the parameters of the Encoder-Decoder network can be obtained through supervised training; 2) Multiply the weighted weight with the corresponding position of the first radar feature to complete the weighting and obtain the second radar feature.
  • Figure 8 and Figure 10 respectively provide a possible implementation form of the feature encoding network. In application, one of them can be selected based on actual needs. This application does not cover the actual application scenarios of the two feature encoding networks. Make restrictions.
  • Part B in Figure 10 shows an implementation of visual feature weighting provided by the embodiment of the present application
  • Part C of Figure 10 shows an implementation of weighted visual radar feature fusion provided by the embodiment of the present application.
  • S906 Input the fused features into the perception task network to complete the corresponding perception task.
  • the parameters of each network involved in the above steps can be obtained through a supervised end-to-end training process based on the perception task.
  • the environmental information sensing method determines the depth map of the radar data after acquiring the radar data and visual images in the driving environment; and performs the first step on the depth map and visual image of the radar data.
  • the first weighted fusion is performed once to obtain the first visual feature; the first radar feature and the first visual feature are then weighted for the second time to obtain the fused feature.
  • environmental perception is performed based on the fused feature to obtain environmental information in the driving environment; in the above method, the computer equipment achieves full utilization and information complementation of data collected by different sensors through multi-level weighted fusion of visual images and radar data, thus improving the accuracy and robustness of the perception task.
  • the environment sensing method provided by the embodiment of the present application is applied to an autonomous driving scenario, the accuracy and robustness of the perception of environmental information in the autonomous driving environment can be improved, thereby improving the safety of autonomous driving.
  • Adaptive adjustment of the environmental perception process based on differences in environmental complexity can further ensure the safety and perception efficiency of the autonomous driving system.
  • Figure 11 shows a block diagram of an environment information sensing device provided by an exemplary embodiment of the present application.
  • the environment information sensing device can be used to implement all or part of the steps of the embodiments shown in Figures 2 and 3.
  • the environmental information sensing device includes: a data acquisition module 1110, used to acquire radar data and visual images in the driving environment; a depth map determination module 1120, used to determine the depth map of the radar data; a first weighted fusion module 1130 , used to perform a first weighted fusion process on the depth map and the visual image to obtain a first visual feature, wherein the first visual feature is used to characterize the depth information and visual information corresponding to the visual image;
  • the second weighted fusion module 1140 is used to perform a second weighted fusion process on the first radar feature and the first visual feature of the radar data to obtain a fusion feature, where the first radar feature is used to characterize the radar
  • the spatial context information of the data in the three-dimensional space; the environment perception module 1150 is used
  • the first weighted fusion module 1130 includes: an image splicing sub-module for splicing the single-channel image information of the depth map and at least one channel image information of the visual image. , to obtain the spliced image; the first weighted sub-module is used to perform weighted fusion processing on the image information in the spliced image to obtain the first visual feature.
  • the second weighted fusion module 1140 includes: a second weighted sub-module configured to perform the first radar feature based on the detection distance indicated by the depth map. Perform weighting processing to obtain the second radar feature; the third weighting submodule is used to perform weighting processing on the first visual feature based on the second radar feature to obtain the second visual feature; the fourth weighting submodule is used to The fusion feature is obtained by performing a weighted fusion process on the second radar feature and the second visual feature.
  • the detection distance is negatively related to the weighting weight used to weight the first radar feature.
  • the third weighted sub-module is used to add the second radar feature and the first visual feature to obtain an added feature; wherein, the second radar The number of channels of the feature is the same as the number of channels of the first visual feature; the spatial dimension of the second radar feature is the same as the spatial dimension of the first visual feature; the added feature is input into the weight acquisition network to obtain the The weighted weight of each area in the visual image is weighted; and the first visual feature of each area is weighted according to the weighted weight of each area to obtain the second visual feature.
  • the visual image contains a target area; the target area is an area in the visual image where the image quality is lower than an image quality threshold; the target area is in the second visual feature The second visual feature value is smaller than the first visual feature value of the target area in the first visual feature.
  • the fourth weighted sub-module is used to splice the second radar feature and the second visual feature to obtain a splicing feature; wherein the splicing feature has at least two channels; perform global average pooling processing on the spliced features to obtain the pooled features; input the pooled features into the fully connected layer for processing to obtain the processed features; perform nonlinear transformation on the processed features , obtain the weighted weight of each channel of the splicing feature; perform weighting processing on the splicing feature according to the weighted weight of each channel of the splicing feature, to obtain the fusion feature.
  • the radar data is point cloud data; the device further includes: a radar feature acquisition module; the radar feature acquisition module includes a space division sub-module, a feature encoding sub-module and a radar feature determiner module; the space division sub-module is used to divide the three-dimensional space of the point cloud data to obtain multiple voxels; the multiple voxels include at least one non-empty voxel, and each of the non-empty voxels contains at least one of the point cloud data; the feature encoding sub-module is used to encode at least one set of sub-radar data in units of the voxels to obtain at least one set of sub-radar features corresponding to the sub-radar data.
  • the sub-radar data includes at least one of the non- Radar data in empty voxels; the sub-radar features include the spatial context information of the local three-dimensional space; the local three-dimensional space is the three-dimensional space occupied by the non-empty voxels corresponding to the sub-radar features; the The radar feature determination sub-module is used to determine a feature set composed of at least one of the sub-radar features as the first radar feature.
  • the environmental information sensing device after acquiring the radar data and visual images in the driving environment, determines the depth map of the radar data; and performs the first step on the depth map and visual image of the radar data.
  • the first weighted fusion is performed once to obtain the first visual feature; the first radar feature and the first visual feature are then weighted for the second time to obtain the fused feature.
  • environmental perception is performed based on the fused feature to obtain environmental information in the driving environment; in the above method, the computer equipment achieves full utilization and information complementation of data collected by different sensors through multi-level weighted fusion of visual images and radar data, thereby improving the accuracy and robustness of the perception task, thereby improving the automatic Driving safety.
  • FIG. 12 shows a structural block diagram of a computer device 1200 according to an exemplary embodiment of the present application.
  • the computer device can be implemented as the environmental information sensing device in the above solution of the present application.
  • the computer device 1200 includes a central processing unit (Central Processing Unit, CPU) 1201, a system memory 1204 including a random access memory (Random Access Memory, RAM) 1202 and a read-only memory (Read-Only Memory, ROM) 1203, and System bus 1205 connects system memory 1204 and central processing unit 1201.
  • the computer device 1200 also includes a mass storage device 1206 for storing an operating system 1209, clients 1210, and other program modules 1211.
  • Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • Computer-readable storage media include RAM, ROM, Erasable Programmable Read Only Memory (EPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM) flash memory or other solid-state memory technology, Compact Disc-ROM (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, tape cassette, magnetic tape, disk storage or other magnetic storage device.
  • RAM random access memory
  • ROM Erasable Programmable Read Only Memory
  • EEPROM Electronically Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc-ROM
  • DVD Digital Versatile Disc
  • tape cassette magnetic tape cassette
  • magnetic tape disk storage or other magnetic storage device.
  • the computer device 1200 may also be connected to a remote computer on the network through a network such as the Internet to run. That is, the computer device 1200 can be connected to the network 1208 through the network interface unit 1207 connected to the system bus 1205, or the network interface unit 1207 can also be used to connect to other types of networks or remote computer systems (not shown). ).
  • the memory also includes at least one instruction, at least a program, a code set or an instruction set.
  • the at least one instruction, at least a program, code set or instruction set is stored in the memory.
  • the central processor 1201 executes the at least one instruction, At least one program, code set or instruction set is used to implement all or part of the steps in the environmental information sensing method shown in the above embodiments.
  • a computer-readable storage medium is also provided. At least one computer program is stored in the computer-readable storage medium. The computer program is loaded and executed by the processor to implement the above environmental information sensing method. all or part of the steps.
  • the computer-readable storage medium can be read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), read-only compact disc (Compact Disc Read-Only Memory, CD-ROM), Tapes, floppy disks and optical data storage devices, etc.
  • a computer program product is also provided.
  • the computer program product includes at least one computer program.
  • the computer program is loaded by the processor and executes any one of the above-mentioned Figure 2, Figure 3, Figure 7 or Figure 9. All or part of the steps of the environmental information sensing method shown in the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Radar Systems Or Details Thereof (AREA)

Abstract

本申请涉及一种环境信息感知方法、装置、系统、计算机设备及存储介质。该方法包括:获取驾驶环境中的雷达数据和视觉图像;确定雷达数据的深度图;对深度图与视觉图像进行第一加权融合处理,得到第一视觉特征,其中,第一视觉特征用于表征视觉图像对应的深度信息以及视觉信息;对雷达数据的第一雷达特征以及第一视觉特征进行第二加权融合处理,得到融合特征,其中,第一雷达特征用于表征雷达数据在三维空间中的空间上下文信息;基于融合特征进行环境感知,得到驾驶环境中的环境信息。

Description

环境信息感知方法、装置、系统、计算机设备及存储介质
交叉引用
本申请要求在2022年09月02日提交中国专利局、申请号为202211072265.3、发明名称为“环境信息感知方法、装置、系统、计算机设备及存储介质”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种环境信息感知方法、装置、系统、计算机设备及存储介质。
背景技术
自动驾驶技术,旨在使车辆具备自主判断、自主控制、自动行驶的能力,具有广阔的发展前景。自动驾驶系统由环境感知、规划决策、控制三个子系统组成;其中,感知系统可以对车辆的周围环境进行理解,识别定位目标和障碍物,是自动驾驶中重要的组成部分。
在感知系统中,通常通过不同传感器实现对周围环境的感知。传感器包括光学相机、激光雷达、四维(4-dimentional,4D)毫米波雷达等。然而,上述传感器中,光学相机获取的图像为纯视觉数据,缺乏环境中的目标物的深度信息,且光学相机的成像质量容易受到外界环境的影响,成像质量不稳定。雷达可以更好的获取环境的三维空间信息,但无法获取环境中的目标物的纹理、颜色信息,且探测数据会受到探测距离的影响。因此,基于纯视觉数据(在本申请中称为视觉图像)进行环境感知,或者基于雷达数据进行环境感知均存在不稳定性,导致环境信息感知效果差,自动驾驶安全隐患高。
发明内容
本申请实施例提供了一种环境信息感知方法、装置、系统、计算机设备及存储介质,可以提高环境感知的稳定性和准确性,提高对环境信息的感知效果。该技术方案如下:
一方面,提供了一种环境信息感知方法,所述方法包括:获取驾驶环境 中的雷达数据和视觉图像;确定所述雷达数据的深度图;对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视觉信息;对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
另一方面,提供了一种环境信息感知装置,所述装置包括:数据获取模块,用于获取驾驶环境中的雷达数据和视觉图像;深度图确定模块,用于确定所述雷达数据的深度图;第一加权融合模块,用于对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视觉信息;第二加权融合模块,用于对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;环境感知模块,用于基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
另一方面,提供了一种环境信息感知系统,所述系统包括图像采集设备、雷达设备以及环境感知设备;所述图像采集设备,用于采集驾驶环境中的视觉图像,并将所述视觉图像发送给所述环境感知设备;所述雷达设备,用于采集所述驾驶环境中的雷达数据,并将所述雷达数据发送给所述环境感知设备;所述环境感知设备,用于接收所述图像采集设备发送的所述视觉图像和所述雷达设备发送的所述雷达数据;确定所述雷达数据的深度图;对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视觉信息;对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
另一方面,提供了一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行以实现上述的环境信息感知方法。
另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现上述的环境信息感知方法。
另一方面,提供了一种计算机程序产品,所述计算机程序产品包括至少一条计算机程序,所述计算机程序由处理器加载并执行以实现上述各种可选实现方式中提供的环境信息感知方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1示出了本申请一示例性实施例提供的环境信息感知方法对应的环境信息感知系统的示意图;
图2示出了本申请一示例性实施例示出的一种环境信息感知方法的流程图;
图3示出了本申请一示例性实施例示出的另一种环境信息感知方法的流程图;
图4示出了本申请一示例性实施例提供的融合特征的获取过程的示意图;
图5示出了本申请一示例性实施例实处的环境感知模型的示意图;
图6示出了本申请一示例性实施例提供的环境信息感知模组的示意图;
图7示出了本申请一示例性实施例提供的封闭单一场景下的环境信息感知方法实施过程的示意图;
图8示出了本申请一示例性实施例提供的一种融合特征获取过程的示意图;
图9示出了本申请一示例性实施例提供的开放复杂场景下的环境信息感知方法的实施过程的示意图;
图10示出了本申请一示例性实施例提供的另一种融合特征获取过程的示意图;
图11示出了本申请一示例性实施例提供的环境信息感知装置的方框图;
图12是根据一示例性实施例示出的计算机设备的结构框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
图1示出了本申请一示例性实施例提供的环境信息感知方法对应的环境信息感知系统的示意图。该系统可以包括图像采集设备110,雷达设备120以及环境信息感知设备130。
该图像采集设备110与雷达设备120可以是装设在目标设备上的传感器;示意性的,该目标设备可以为驾驶场景下的车辆、船只等交通工具。其中,该图像采集设备110是具有图像采集功能的设备,比如光学相机、摄像头以及具有图像采集功能的终端等等,该图像采集设备110用于采集驾驶环境中的视觉图像,并将视觉图像发送给环境感知设备130;该雷达设备120可以是具有向周围环境发射波束并接收反射回波形成雷达数据的设备,该雷达设备120用于采集驾驶环境中的雷达数据,并将雷达数据发送给环境感知设备130;示意性的,该雷达设备120可以是激光雷达以及4D毫米波雷达等等,该雷达数据可以是点云数据。
该环境信息感知设备130可以通过有线网络或者无线网络分别与图像采集设备110和雷达设备120进行通讯,用于接收图像采集设备110发送的视觉图像,以及接收雷达设备120发送的雷达数据,并对接收到的视觉图像和雷达数据进行特征提取和多级加权融合操作后得到融合特征,之后基于融合特征实现对驾驶环境的环境信息感知,该过程包括:接收图像采集设备发送 的视觉图像和雷达设备发送的雷达数据;确定雷达数据的深度图;对深度图与视觉图像进行第一加权融合处理,得到第一视觉特征,其中,该第一视觉特征用于表征视觉图像对应的深度信息以及视觉信息;对雷达数据的第一雷达特征以及第一视觉特征进行第二加权融合处理,得到融合特征,其中,该第一雷达特征用于表征雷达数据在三维空间中的空间上下文信息;基于融合特征进行环境感知,得到驾驶环境中的环境信息。
在本申请实施例中,该环境信息感知设备130可以实现为终端或者服务器,当该环境信息感知设备130为服务器时,该环境信息感知设备130可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器;当该环境信息感知设备130为终端时,该环境信息感知设备130可以实现为智能手机、平板电脑、笔记本电脑、台式计算机、车载终端等等,本申请对环境信息感知设备130的实现形式不进行限制。
图2示出了本申请一示例性实施例示出的一种环境信息感知方法的流程图,该环境信息感知方法可以由计算机设备执行,该计算机设备可以实现为如图1所示的环境信息感知设备130,如图2所示,该环境信息感知方法可以包括以下步骤:
步骤210,获取驾驶环境中的雷达数据和视觉图像。
该驾驶环境可以是自动驾驶场景中的驾驶环境,也可以是辅助驾驶场景中的驾驶环境,本申请对此不进行限制。
其中,雷达数据是通过雷达设备采集的,视觉图像是通过图像采集设备采集的;在驾驶场景中,计算机设备可以接收与计算机设备建立通信关系的图像采集设备以及雷达设备分别实时上传的视觉图像以及雷达数据,从而得到视觉图像以及雷达数据。该雷达设备和视觉图像采集设备可以装设在目标设备上;示意性的,该目标设备可以是装设有光学相机以及雷达的车辆,本申请对目标设备的实现形式不进行限制。
步骤220,确定雷达数据的深度图。
其中,确定雷达数据的深度图的过程可以实现为:将雷达数据映射到图 像平面,得到雷达数据的像素点位置;基于雷达数据的空间位置与雷达设备在水平方向上的距离,确定各个雷达数据在像素点位置上的像素值;基于像素点位置以及各个雷达数据在像素点位置上的像素值,生成深度图。
可选的,在确定雷达数据的深度图之前,该方法还包括:对雷达设备和图像采集设备进行参数标定和时间同步,得到投影变换矩阵;该投影变换矩阵用以实现雷达数据的平面映射,同时使得平面映射获取到的深度图与图像采集设备采集到的视觉图像的坐标相对应。示意性的,雷达设备采集到的三维(3-dimentional,3D)点云数据(即雷达数据)的位置矢量可以表示为u=[x,y,z,1]T,图像采集设备采集到的视觉图像中的各个像素点的位置矢量可以表示为v=[m,n,1]T,投影变换矩阵可以表示为三者之间的关系可以表示为:
v=Pu        (1)
因此,在对雷达数据进行平面映射时,计算机设备可以根据已知的投影变换矩阵将点云数据投影至图像平面,得到点云数据在图像平面中的对应位置;之后,计算点云数据对应的空间位置与雷达设备水平方向的距离值;基于该距离值获取点云数据在图像平面中对应位置的像素点的像素值。示意性的,在获取到点云数据对应的空间位置与雷达设备水平方向的距离值后,可以对该距离值进行线性缩放,使得各个点云数据对应的像素值处于0~255的取值范围之间;基于点云数据在图像平面中的对应位置,以及各个对应位置上的像素点的像素值,生成灰度图形式的深度图。
步骤230,对深度图与视觉图像进行第一加权融合处理,得到第一视觉特征,其中,该第一视觉特征用于表征视觉图像对应的深度信息以及视觉信息。
由于视觉图像可以体现出环境中的目标物的视觉信息,比如纹理信息,色彩信息等,但难以体现环境中的目标物的深度信息,即目标物距离图像采集设备的距离,因此,在本申请实施例中,计算机设备可以通过对雷达数据的深度图与视觉图像进行第一加权融合处理,通过视觉图像中的深度信息增强视觉特征,得到能够体现深度信息和视觉信息的第一视觉特征。
步骤240,对雷达数据的第一雷达特征以及第一视觉特征进行第二加权融合处理,得到融合特征,其中,该第一雷达特征用于表征雷达数据在三维空间中的空间上下文信息。
该空间上下文信息可以包括空间信息以及分布信息。
在本申请实施例中,在获得第一雷达特征以及第一视觉特征后,为进一步实现多模态特征之间的信息互补,提高感知任务精度,计算机设备会对第一雷达特征与第一视觉特征进一步进行加权融合处理,即进行第二加权融合,以得到融合特征。
其中,该融合特征用于表征驾驶环境中的环境信息;需要说明的是,该融合特征指示的环境信息是基于驾驶场景下所需执行的环境感知任务确定的;示意性的,当感知任务为目标检测任务时,该融合特征包含用于确定目标物位置及类别所需的信息;比如,当感知任务为车道线识别任务时,该融合特征用于表征该驾驶环境中的车道线位置信息。
步骤250,基于融合特征进行环境感知,得到驾驶环境中的环境信息。
可选的,环境感知可以包括但不限于目标检测,语义分割等感知任务,计算机设备可以基于感知任务的不同,对融合特征进行不同的环境感知操作,以完成对应的感知任务,得到驾驶环境中的环境信息。
综上所述,本申请实施例提供的环境信息感知方法,在获取到驾驶环境中的雷达数据和视觉图像之后,通过确定雷达数据的深度图;对雷达数据的深度图与视觉图像进行第一次加权融合,得到第一视觉特征;再将第一雷达特征与第一视觉特征进行第二次加权融合,得到融合特征,最后,基于融合特征进行环境感知,得到驾驶环境中的环境信息;在上述方法中,计算机设备通过对视觉图像与雷达数据的多级加权融合,实现了对不同传感器采集的数据的充分利用和信息互补,从而提高了感知任务的精度和鲁棒性,进而提高了驾驶的安全性。
图3示出了本申请一示例性实施例示出的另一种环境信息感知方法的流程图,该环境信息感知方法可以由计算机设备执行,该计算机设备可以实现为如图1所示的环境信息感知设备130,如图3所示,该环境信息感知方法可以包括以下步骤:
步骤310,获取驾驶环境中的雷达数据和视觉图像。
其中,雷达数据是计算机设备通过雷达设备采集的,视觉图像是计算机设备通过图像采集设备采集的。
在本申请实施例中,雷达设备获取的雷达数据为点云数据,分布在长宽 高分别为L,W,H的三维空间内,记点云集合P={pi=[xi,yi,zi,ri]}i=1,…,N,其中,pi表示点云数据中的点,[xi,yi,zi,ri]表示该点在三维空间中的位置和反射强度。图像采集设备获取的视觉图像可以为RGB图像或者灰度图,本申请对此不进行限制。
步骤320,对雷达数据进行特征编码,得到雷达数据的第一雷达特征。
该第一雷达特征用于表征雷达数据在三维空间中的空间上下文信息。
在本申请实施例中,计算机设备可以直接对获得的雷达数据逐个进行特征编码,得到第一雷达特征;或者,为提高编码效率,节省计算机设备的计算量,计算机设备也可以对获得的雷达数据所处的三维空间进行空间划分,对空间划分获得的若干体素中的雷达数据分组进行特征编码,得到增强的雷达特征,即第一雷达特征;示意性的,雷达数据分布在长宽高为L×W×H的空间里,计算机设备可以将该空间以VL×VW×VH的粒度均匀划分为若干个长方体,每个长方体称为一个体素,每个体素中可以包含若干雷达数据,或者,不包含雷达数据;本申请实施例中,将包含有雷达数据的体素称为非空体素;以体素为单位,对各个非空体素中的雷达数据进行特征编码,得到各个非空体素编码后的特征,从而将多个非空体素各自对应的特征组成的特征集合确定为第一雷达特征。
以第一雷达特征为增强的雷达特征为例,对雷达特征进行特征编码的过程可以实现为:对点云数据的三维空间进行划分,得到多个体素;多个体素中包含至少一个非空体素,每个非空体素中包含至少一个点云数据;以体素为单位,对至少一组子雷达数据进行编码,得到至少一组子雷达数据分别对应的子雷达特征;该子雷达数据是指一个非空体素中的点云数据;该子雷达特征中包含局部三维空间的空间上下文信息;该局部三维空间是子雷达特征对应的非空体素占用的三维空间;将至少一个子雷达特征组成的特征集合确定为第一雷达特征。
其中,计算机设备在以体素为单位,对至少一组子雷达数据进行编码时,可以将至少一组子雷达数据输入特征编码网络,得到特征编码网络输出的各个非空体素对应的编码后的特征,即子雷达特征,将子雷达特征组成的特征集合确定为第一雷达特征。
在本申请实施例中,该特征编码网络可以包括多层堆叠的全连接层,以在对点云数据进行处理时,逐步对点云数据的特征维度进行扩充;其中,全 连接层的参数信息以及层数可以根据处理的数据和任务进行训练和更改,本申请对此不进行限制。
步骤330,确定雷达数据的深度图。
计算机设备确定雷达数据的深度图的过程可以参考图2所示实施例的相关内容,此处不再赘述。
进一步的,为了为视觉图像提供更丰富的深度信息,在得到雷达数据的深度图之后,计算机设备还可以对深度图进行深度补全;其中,深度补全是指对稀疏的深度图进行插值处理,以减少深度图中的深度缺失,从而得到较为丰富的深度信息,有益于丰富第一视觉特征所包含的信息量,从而提高整体环境感知任务的精度。
步骤340,将深度图的单通道图像信息与视觉图像的至少一个通道图像信息进行拼接,得到拼接图像。
其中,该拼接图像的通道数是深度图的通道数与视觉图像的通道数之和。
在本申请实施例中,计算机设备在对深度图和视觉图像进行拼接的过程中,可以先将深度图和视觉图像分别转化为深度数据矩阵和视觉数据矩阵,其中深度数据矩阵的数据维度为[height,width,1],当视觉图像为RGB图像时,视觉数据矩阵的数据维度为[height,width,3],其中,高度*宽度(height*width)为图片的像素数量,1和3分别表示图片的通道的数量,进而在对深度图的单通道图像信息与视觉图像的多通道图像信息进行拼接后,获得数据维度为[height,width,4]的拼接图像。本申请对深度图的单通道图像信息与视觉图像的多通道图像信息的拼接顺序不进行限制。
步骤350,对拼接图像中的图像信息进行加权融合处理,得到第一视觉特征。
该拼接图像中的图像信息包括深度图的单通道图像信息与视觉图像的至少一个通道图像信息。
可选的,计算机设备可以通过加权融合网络对拼接图像中的图像信息进行加权融合处理,得到第一视觉特征。
示意性的,该加权融合网络可以实现为堆叠的卷积层,该加权融合网络可以自适应地对拼接图像中的图像信息进行加权融合,得到第一视觉特征。
步骤360,对雷达数据的第一雷达特征以及第一视觉特征进行第二加权融合处理,得到融合特征。
在雷达特征与视觉特征的空间维度相同的前提下,才能实现雷达特征与视觉特征的加权融合。因此,若融合特征是第一雷达特征与第一视觉特征加权融合后获得的特征,在对雷达数据的第一雷达特征以及第一视觉特征进行第二加权融合处理,得到融合特征之前,该方法还包括:对第一雷达特征进行调整,获得空间维度相同的第一雷达特征和第一视觉特征;对空间维度相同的第一雷达特征和第一视觉特征进行第二加权融合,得到融合特征。
由于第一视觉特征是通过加权融合网络获得的,在加权融合网络对拼接图像的处理过程中,可能涉及下采样的处理过程,从而使得获得的第一视觉特征与视觉图像可能存在空间维度上的差异,因此,在不同的情况下可以基于不同的方式获取空间维度相同的第一雷达特征和第一视觉特征;该过程可以实现为:
若第一视觉特征的空间维度大小与视觉图像的空间维度大小相同,按照投影变换矩阵,将第一雷达特征映射到图像平面中,即可获得空间维度相同的第一雷达特征和第一视觉特征;若第一视觉特征的空间维度大小相较于视觉图像的空间维度大小存在差异,则需要调整第一雷达特征,使得调整后的第一雷达特征和第一视觉特征空间的维度一致,即在按照投影变换矩阵,将第一雷达特征映射到图像平面中后,对映射后的第一雷达特征进行空间维度的调整,调整后的第一雷达特征与第一视觉特征具有相同的空间维度,从而获得空间位置对应的雷达特征和视觉特征。
或者,在第一视觉特征的空间维度大小相较于视觉图像的空间维度大小存在差异时,也可以对第一视觉特征进行调整,使得调整后的第一视觉特征和第一雷达特征的空间维度一致,从而获得空间位置对应的雷达特征和视觉特征。
或者,为了进一步提高融合特征包含的信息的准确性,在获得第一雷达特征与第一视觉特征后,计算机设备可以依次对第一雷达特征进行加权处理,对第一视觉特征进行加权处理,对多模态特征进行加权融合处理;也就是说,获得融合特征的过程可以实现为:基于深度图所指示的探测距离,对第一雷达特征进行加权处理,得到第二雷达特征;基于第二雷达特征,对第一视觉特征进行加权处理,得到第二视觉特征;对第二雷达特征以及第二视觉特征进行加权融合处理,得到融合特征。
在此情况下,在加权处理之前,需要对雷达特征和视觉特征进行空间维 度的统一;即保证深度图与第一雷达特征的空间维度一致;保证第二雷达特征与第一视觉特征的空间维度一致;因此,该过程可以实现为:按照投影变换矩阵,将第一雷达特征映射到图像平面中,获得映射的第一雷达特征;映射的第一雷达特征与深度图的空间维度相同,可以基于深度图所指示的探测距离,对映射的第一雷达特征进行加权处理,得到第二雷达特征;若第一视觉特征的空间维度大小与视觉图像的空间维度大小相同,则第二雷达特征的空间维度大小与第一视觉特征的空间维度大小相同,可以直接基于第二雷达特征,对第一视觉特征进行加权处理,得到第二视觉特征。
若第一视觉特征的空间维度大小与视觉图像的空间维度大小不同,则第二雷达特征的空间维度大小与第一视觉特征的空间维度大小不同,此时,需要对第二雷达特征的空间维度进行调整,以使得调整后的第二雷达特征与第一视觉特征的空间维度相同,从而基于调整后的第二雷达特征,对第一视觉特征进行加权处理,得到第二视觉特征。
或者,若第一视觉特征的空间维度大小与视觉图像的空间维度大小不同,也可以对第一视觉特征的空间维度进行调整,以使得调整后的第一视觉特征与第二雷达特征的空间维度相同,从而基于第二雷达特征,对调整后的第一视觉特征进行加权处理,得到第二视觉特征。
以下实施例中涉及的第一雷达特征默认为映射后的第一雷达特征,第二雷达特征默认为调整后的第二雷达特征;即第一雷达特征与深度图的空间维度相同,第二雷达特征与第一视觉特征的空间维度相同,不进行映射前,映射后和调整后的区分。
在对第一雷达特征进行加权处理时,对第一雷达特征进行加权处理时的加权权重,是基于深度图所指示的探测距离确定的;其中,探测距离用以指示目标物在三维空间中的位置与雷达设备之间的距离。雷达数据的深度图直接反映了环境中目标物距离观测点(雷达设备)的距离信息,同时也蕴含了感兴趣区域的物体轮廓信息,因此,计算机设备可以将雷达数据的深度图获取为第一雷达特征的先验信息,并结合该先验信息,对第一雷达特征进行加权处理,得到第二雷达特征。
由于雷达设备在探测较远距离的目标物时,生成的点云数据会较为稀疏,且噪声较大,其置信度会相应降低;而在探测近距离及中距离的目标物时,点云较为致密,具有较好的判别性和较高的置信度;因此,可选地,探测距 离与第一雷达特征进行加权处理的加权权重反相关;也就是说,第一雷达特征中的特征数据对应的加权权重的大小,与该特征数据对应的探测距离的大小反相关;示意性的,计算机设备可以对探测距离较远的区域对应的特征数据赋予较小的加权权重,对探测距离较近区域对应的特征数据赋予较大的加权权重。
可选的,为提高权重赋值的准确性,计算机设备可以将深度图输入到权重赋值网络中,得到权重赋值网络输出的第一雷达特征的加权权重;该权重赋值网络可以充分利用深度图中的距离信息及物体轮廓信息,从而提高对加权权重的赋值精度。
在对第一视觉特征进行加权处理时,由于视觉图像的图像质量容易受到强光或阴影的影响,造成虚检或漏检,因此,应尽可能减少强光或阴影影响的图像区域中的视觉特征的影响;但是由于视觉图像中受到影响的区域无法通过先验得知,因此,本申请实施例中提供一种使用雷达特征进行辅助,结合第一视觉特征的自身特征,共同对第一视觉特征进行的加权的方式。
可选地,计算机设备可以将第二雷达特征与第一视觉特征进行加合,并将加合后的输出(加合特征)使用权重获取网络进行处理,得到视觉图像中每个区域的加权权重,将该加权权重与第一视觉特征进行对应位置相乘,实现对第一视觉特征加权,得到第二视觉特征。
也就是说,对第一视觉特征进行加权处理可以实现为以下过程:将第二雷达特征与第一视觉特征进行加合,得到加合特征;其中,该第二雷达特征的通道数与第一视觉特征的通道数相同;该第二雷达特征的空间维度与第一视觉特征的空间维度相同;将加合特征输入权重获取网络,得到视觉图像中的各个区域的加权权重;根据各个区域的加权权重,对各个区域的第一视觉特征进行加权处理,得到第二视觉特征。
在本申请实施例中,视觉图像中包含目标区域;该目标区域为视觉图像中图像质量低于图像质量阈值的区域;目标区域在第二视觉特征中的第二视觉特征值,小于目标区域在第一视觉特征中的第一视觉特征值。
也就是说,视觉图像中图像质量较低的目标区域的加权权重较低,以减少图像质量较差的目标区域中的视觉特征的影响。
其中,对第二雷达特征与第一视觉特征进行加合可以是对第二雷达特征与第一视觉特征逐通道进行加合,得到的加合特征;因此,需要保证第二雷 达特征与第一视觉特征的通道数和空间维度一致;在已实现第二雷达特征与第一雷达特征的空间维度一致的情况下,若第二雷达特征与第一视觉特征的通道数一致,则可以直接执行加合操作,获得加合特征;若第二雷达特征与第一视觉特征的通道数不一致,在将第二雷达特征与第一视觉特征进行加合之前,可以使用卷积神经网络对第二雷达特征与第一视觉特征分别进行处理,以使第二雷达特征与第一视觉特征具有相同的通道数。
对第二雷达特征与第一视觉特征进行加合获得的加合特征,与第二雷达特征和第一视觉特征也具有相同的通道数。
其中,视觉特征中的各个区域可以与雷达数据的体素映射到图像平面上的区域相对应。该权重获取网络用于对加合特征进行权重提取,获得视觉图像中的各个区域的加权权重。
雷达数据可以较好的保留环境中的目标物的三维几何信息,而视觉图像可以较好的保留环境中的目标物的视觉信息;不同的感知任务对不同传感器的特征的重视程度也不尽相同,因此,在进行多模态特征融合时,可以基于感知任务的需求对多模态特征进行加权融合,以得到与感知任务的需求相适应的融合特征。
本申请实施例提供一种可能的获得融合特征的方式,可选的,获得融合特征的过程可以实现为:将第二雷达特征和第二视觉特征进行拼接,得到拼接特征;其中,该拼接特征具有至少两个通道;对拼接特征进行全局平均池化处理,得到池化后特征;将池化后特征输入全连接层进行处理,得到处理后特征;对处理后特征进行非线性变换,得到拼接特征的各个通道的加权权重;根据拼接特征的各个通道的加权权重,对拼接特征进行加权处理,得到融合特征。
可选的,将第二雷达特征和第二视觉特征进行拼接,得到拼接特征可以是指,将第二雷达特征和第二视觉特征进行通道维度上的拼接,得到拼接特征;该拼接特征的通道数是第二雷达特征的通道数与第二视觉特征的通道数之和。示意性的,若第二雷达特征的通道数与第二视觉特征的通道数分别为64,那么获得的拼接特征的通道数为128。
其中,上述基于拼接特征获得拼接特征的各个通道的加权权重的过程可以实现为,将拼接特征输入到权重提取网络中,获得权重提取网络输出的拼接特征的各个通道的加权权重;该权重提取网络中可以包括池化层、全连接 层以及非线性变换层;基于此,权重提取网络对拼接特征的处理过程可以实现为:通过池化层对拼接特征进行全局平均池化,得到池化后特征;将池化后特征输入全连接层进行处理,将经过多层全连接层处理后得到的处理后特征输入到非线性变换层,通过非线性变换层对处理后特征进行非线性变换,得到拼接图像各个通道的加权权重。
该权重提取网络是基于感知任务需求训练得到的。该权重提取网络可以基于感知任务的不同,对来自不同传感器的特征进行不同的权重调节,以提高相应的感知任务的完成度。
图4示出了本申请一示例性实施例提供的融合特征的获取过程的示意图,如图4所示,计算机设备在获取到第一视觉特征410,第一雷达特征420以及雷达数据的深度图430之后,可以先将深度图430作为第一雷达特征410的先验信息,对第一雷达特征420进行加权处理,得到第二雷达特征440;接着,以第二雷达特征440为辅助特征,对第一视觉特征410进行加权处理,得到第二视觉特征450;之后,对第二雷达特征440与第二视觉特征450进行加权融合,即进行多模态特征加权融合,以得到融合特征460。
步骤370,基于融合特征进行环境感知,得到驾驶环境中的环境信息。
该环境信息可以包括驾驶环境中的目标物的位置信息以及数量信息中的至少一种。
在本申请实施例中,计算机设备可以将融合特征输入到感知网络中进行处理,以实现对环境信息感知。该感知网络可以是基于不同的感知任务对应设计的网络结构,该感知网络用于对融合特征进行处理并输出检测结果,以完成对应的感知任务;其中,该感知任务包括但不限于目标检测、语义分割等。示意性的,在自动驾驶场景中,该感知任务可以实现为对车道线信息的感知,对信号灯信息的感知以及对路上行人的数量和位置的感知等等,本申请对此不进行限制。
可选的,在本申请实施例中,计算机设备可以通过环境感知模型实现对驾驶环境的感知,该环境感知模型中可以包含上述实施例中涉及的特征编码网络、加权融合网络、权重赋值网络,权重提取网络、权重获取网络以及感知网络等网络结构;该环境感知模型用于在接收到驾驶环境的雷达数据以及视觉图像后,对雷达数据与视觉图像进行处理,以完成相应的感知任务。图5示出了本申请一示例性实施例实处的环境感知模型的示意图,如图5所示, 该环境感知模型500包括:特征编码网络510、加权融合网络520、权重赋值网络530,权重提取网络540、权重获取网络550以及感知网络560;示意性的,该环境感知模型500在获取到驾驶环境的雷达数据和视觉图像后,在将雷达数据进行空间划分后,将体素级的雷达数据分别输入到特征编码网络510中,得到特征编码网络510输出的各个体素级的雷达数据(即子雷达数据)的子雷达特征,将子雷达数据的集合获取为第一雷达特征;同时,在对雷达数据进行平面映射得到深度图后,将深度图与视觉图像进行通道维度的拼接,得到拼接图像后,将拼接图像输入到加权融合网络520中,得到加权融合网络520输出的第一视觉特征,该第一视觉特征可以表征深度信息和视觉信息;为实现对第一雷达特征的加权处理,将深度图输入到权重赋值网络530中,得到权重赋值网络530输出的第一雷达特征的加权权重,以基于该加权权重对第一雷达特征进行加权,得到第二雷达特征;为实现对第一视觉特征的加权处理,可以先对第二雷达特征与第一视觉特征进行逐通道加合,得到加合特征,再将加合特征输入到权重获取网络550中,得到权重获取网络550输出的视觉图像中的各个区域的加权权重,以基于该加权权重对第一视觉特征进行加权,得到第二视觉特征;在得到第二雷达特征和第二视觉特征后,对两者进行通道维度的拼接,得到拼接特征,并将该拼接特征输入到权重提取网络540中,得到权重提取网络540输出的各个通道的加权权重,以基于该加权权重实现多模态特征之间的特征融合,得到融合特征;将融合特征输入到感知网络550中,获得感知网络550输出的感知结果,该感知结果是与驾驶环境中的感知任务相对应的结果。
在本申请实施例中,该环境感知模型可以是通过端到端的模型训练方式训练得到的机器模型;即该环境感知模型是基于样本视觉图像,样本雷达数据以及感知结果标签训练得到的,该感知结果标签可以基于感知任务的不同进行设置;示意性的,当感知任务为目标检测时,该感知任务标签可以设置为环境中的目标物所在的位置标签及类别标签,此时训练好的环境感知模型用于检测驾驶环境中的目标物所在的位置及类别,也就是说,驾驶环境的环境信息为驾驶环境中目标物的位置信息及类别信息。
在对环境感知模型的训练过程中,对环境感知模型中包含的各个网络的网络参数进行调整,从而实现对多模态特征的权重调整,以使得训练完成后的环境感知模型可以更好的完成对应的感知任务。
综上所述,本申请实施例提供的环境信息感知方法,在获取到驾驶环境中的雷达数据和视觉图像之后,通过确定雷达数据的深度图;对雷达数据的深度图与视觉图像进行第一次加权融合,得到第一视觉特征;再将第一雷达特征与第一视觉特征进行第二次加权融合,得到融合特征,最后,基于融合特征进行环境感知,得到驾驶环境中的环境信息;在上述方法中,计算机设备通过对视觉图像与雷达数据的多级加权融合,实现了对不同传感器采集的数据的充分利用和信息互补,从而提高了感知任务的精度和鲁棒性,进而提高了驾驶的安全性。
图6示出了本申请一示例性实施例提供的环境信息感知模组的示意图,该环境信息感知模组可以应用于计算机设备(比如环境信息感知系统中的环境信息感知设备130)中,用以实现如图2和图3任一所示实施例的全部或部分步骤,如图6所示,该环境信息感知模组可以包括投影变换部件610,第一加权融合部件620,雷达数据编码部件630,第二加权融合部件640以及感知网络部件650;其中,投影变换部件610用于将雷达数据投影至图像平面,获得点云数据在图像平面的对应位置;获取点云数据对应的空间位置与雷达设备(比如车辆)的水平距离值,并对该水平距离值进行线性缩放,获得点云数据的像素值,从而基于点云数据在图像平面的对应位置和像素值生成灰度图形式的深度图;第一加权融合部件620,用于对雷达数据的深度图与视觉图像进行加权融合和处理,得到经过深度图增强的视觉特征,即第一视觉特征;雷达数据编码部件630,用于将雷达数据按空间位置分布划分为若干体素,通过神经网络结构对各个体素内的雷达数据进行特征编码,得到具有空间上下文信息的增强的雷达特征,即第一雷达特征;第二加权融合部件640,用于将第一雷达特征与第一视觉特征根据自身特性分别进行加权,然后对加权后的雷达特征(第二雷达特征)和加权后的视觉特征(第二视觉特征)进行多模态特征的加权融合,输出最终的融合特征;感知网络部件650用于将最终的融合特征作为输入,采用神经网络结构对融合特征进行处理,执行包括但不限于目标检测、语义分割等感知任务,从而获得对应于感知任务的环境信息。
以在自动驾驶领域中对自动驾驶环境进行环境感知的场景为例,图2和 图3所示实施例提供的环境信息感知方法可以应用于如下场景中,需要说明的是,本申请提供的场景应用仅为示意性的,本申请不对环境信息感知方法的应用场景进行限制。以下实施例基于图5所示的环境信息感知模组实现环境信息感知方法进行说明:
1、封闭单一场景下的自动驾驶感知。
在一些特定封闭环境下,如矿山、港口等区域,自动驾驶设备需识别的目标物、障碍物较单一,同时,这些场景中的目标物或障碍物的体积通常较大,基于这一特点,可适当降低对传感器设备精度的要求,以节省设备部署成本。在环境信息感知系统所使用的传感器中,高精度激光雷达造价较高,因此可采用较低线数的激光雷达或4D毫米波雷达取代高精度激光雷达进行雷达数据采集,结合光学相机采集的视觉图像,进行环境感知。
图7示出了本申请一示例性实施例提供的封闭单一场景下的环境信息感知方法实施过程的示意图,如图7所示,该环境信息感知方法的实施过程可以实现为:
S701,获取经过参数标定和时间同步的雷达数据和视觉图像以及投影变换矩阵。
其中,该雷达数据和视觉图像是处于自动驾驶过程中的自动驾驶设备,通过自身装载的雷达设备和图像采集设备获得的。
点云数据(即雷达数据)与视觉图像中的像素,可通过投影变换矩阵P进行空间位置对应。
S702,将雷达数据输入投影变换部件。
投影变换部件可以将雷达数据根据投影变换矩阵P投影至图像平面,获得点云数据在图像平面的对应位置,并计算点云数据的采集位置与雷达设备之间的水平距离值,使用该水平距离值作为深度图的像素取值,并将该像素取值的范围进行线性缩放,使其处于0-255的区间内,将缩放后的像素取值获取为深度图的像素值,进而基于点云数据在图像平面的对应位置和像素值,生成深度图。
由于低线数激光雷达或4D毫米波雷达得到的点云数据较为稀疏,其投影后得到的深度图是稀疏的。在本申请实施例中,投影变换部件可以采用传统的图像处理方法,包括图像膨胀、形态学闭操作、中值滤波、高斯滤波等,对稀疏的深度图进行补全,形成稠密的深度图。其中,传统的图像处理方法 的参数,如使用的核的大小及形状等可以结合实际进行调整,本申请不进行限制。
S703,将深度图与视觉图像输入第一加权融合部件。
第一加权融合部件首先对视觉图像与深度图进行通道维度的拼接,获得拼接图像;然后使用堆叠的卷积层(即加权融合网络)对拼接图像进行处理,自适应地加权融合深度信息与图像信息,得到增强的视觉特征(即第一视觉特征)。
S704,将雷达数据输入雷达数据编码部件进行特征编码。
在本申请实施例中,设雷达数据分布在长宽高为L×W×H的空间里,将该空间以VL×VW×VH的粒度均匀划分为若干个长方体,每个长方体称为一个体素,每个体素中包含若干点云数据或不包含点云数据。可选的,该雷达数据编码部件可以是基于VoxelNet(体素网络)结构构建的,此时,雷达数据编码部件可以使用多层堆叠的全连接层网络对体素中的非空体素进行编码,得到逐个非空体素的特征表示(即子雷达特征),该特征表示中包含了非空体素的空间上下文信息;将子雷达特征的集合确定为对雷达数据的特征编码结果,即第一雷达特征。
S705,将深度图、第一视觉特征和第一雷达特征输入第二加权融合部件。
在第二加权融合部件中,首先将第一雷达特征根据投影变换矩阵映射至图像平面,该映射可以是以体素为单位进行的映射,每个体素对应于图像平面上的一个区域;在本申请实施例中,映射后的第一雷达特征,与第一视觉特征的空间维度大小相同。对第一视觉特征与第一雷达特征分别根据自身特性进行加权;示意性的,以深度图作为先验信息对第一雷达特征进行加权,获得第二雷达特征;以第二雷达特征为辅助,结合第一视觉特征的自身特性对第一视觉特征进行加权,获得第二视觉特征,然后进行多模态数据的加权融合,即对第二视觉特征和第二雷达特征进行加权融合,得到融合特征。
图8示出了本申请一示例性实施例提供的一种融合特征获取过程的示意图,其中,图8中的A部分示出了本申请实施例提供的雷达特征加权的一种实施方式。如图8中的A部分所示,在本申请实施例中,对第一雷达特征加权时,采用深度图作为先验信息,将深度图的像素值采用以下式(2)进行归一化处理:
其中,dmin为深度图像素取值中的最小值,dmax为深度图像素取值中的最大取值。令加权权重dw=1-dnorm,从而可得到第一雷达特征不同区域的加权权重。
进一步的,为避免采用深度图后过度关注距离近处的点云数据,采用如下式(3)的重加权函数(Re-weighted Function)对加权权重进行处理,获得调整后的加权权重dt
上述重加权函数可以减小点云数据的加权权重差异,同时适当保留了中远距离的点云信息,将通过上述重加权函数得到的加权权重与第一雷达特征(该第一雷达特征可以是指基于投影变换矩阵,映射到图像平面上获得的映射后的第一雷达特征)进行空间上逐像素相乘,则可对雷达数据中距离较远、噪声较大的点进行抑制,获得第二雷达特征。
图8中的B部分示出了本申请实施例提供的视觉特征加权的一种实施方式,如图8中的B部分所示,在本申请实施例中,视觉特征加权的过程包括:1)使用卷积核为1×1的卷积层分别对第二雷达特征和第一视觉特征进行处理,使其具有相同的通道数;2)对第一视觉特征与第二雷达特征进行逐通道加合,获得加合特征,对加合特征使用非线性激活函数进行处理,示意性的,该非线性激活函数可以是ReLU函数,之后对激活函数处理后的输出使用3×3卷积层进行处理,进一步的,对卷积层处理后的输出使用非线性激活函数激活,得到视觉图像中的各个区域的加权权重,示意性的,该非线性激活函数可以是Sigmoid函数;3)将该视觉图像中的各个区域的加权权重与第一视觉特征进行对应位置相乘,即可完成对第一视觉特征的加权,得到第二视觉特征,且目标区域的第二视觉特征值小于第一视觉特征值,该目标区域为视觉图像中图像质量低于图像质量阈值的区域,从而抑制视觉图像中受到强光、阴影部分影响的区域对感知任务的影响。其中,该图像质量可以是对该目标区域内的图像评分,图像质量阈值可以基于实际情况由网络自适应进行设定,本申请对此不进行限制。
图8中的C部分示出了本申请实施例提供的视觉雷达特征加权融合的一种实施方式,如图8中的C部分所示,首先将第二雷达特征与第二视觉特征进行通道维度的拼接,获得拼接特征;之后,可使用权重提取网络进行拼接特征的权重学习;示意性的,权重提取网络可以是SENet(Squeeze-and-Excitation Networks,挤压激励网络),在此情况下,权重提取网络首先对拼接后特征进行全局平均池化,再对池化后的特征使用多个全连接层和非线性激活函数进行处理,得到拼接特征的各个通道的加权权重,该加权权重与拼接特征进行逐通道相乘,得到融合特征。
S706,将融合特征输入到感知任务网络中,完成相应的感知任务。
上述步骤中所涉及的各个网络的参数获取可基于感知任务,采取有监督的端到端训练过程获取。
示意性的,当感知任务为目标检测任务时,对上述网络组成的模型的训练集可以包括样本雷达数据,样本视觉图像以及目标对象标签;在将训练好的模型应用到上述场景中时,计算机设备可以基于获得自动驾驶环境的雷达数据和视觉图像,确定自动驾驶环境中的目标对象。
2、开放复杂场景下的自动驾驶感知。
在开放复杂场景下,如城市道路、十字路口等,自动驾驶设备需识别的目标物、障碍物非常复杂,场景中可能存在大量的行人、自行车等相对较小的目标,对自动驾驶设备的安全性提出了更高的要求。在此类场景中,需采用高精度的激光雷达并结合视觉图像,对周围环境进行精确感知。
图9示出了本申请一示例性实施例提供的开放复杂场景下的环境信息感知方法的实施过程的示意图,如图9所示,该环境信息感知方法的实施过程可以实现为:
S901,获取经过参数标定和时间同步的雷达数据和视觉图像以及投影变换矩阵。
通过投影变换矩阵,实现雷达数据与视觉图像的空间位置对应。
S902,将雷达数据输入投影变换部件,得到雷达数据的深度图。
投影变换部件可以将雷达数据根据投影变换矩阵P投影至图像平面,获得点云数据在图像平面的对应位置,并计算点云数据的采集位置与雷达设备之间的水平距离值,使用该水平距离值作为深度图的像素取值,并将该像素取值的范围进行线性缩放,使其处于0-255的区间内,将缩放后的像素取值 获取为深度图的像素值,进而基于点云数据在图像平面的对应位置和像素值,生成深度图。在本实施例中,由于采用了高精度的激光雷达,可以不进行深度补全,以节省数据处理成本。
S903,将深度图与视觉图像输入第一加权融合部件,得到第一视觉特征。
S903的实现过程可以参考S703的相关内容,此处不再赘述。
S904,将雷达数据输入雷达数据编码部件进行特征编码,得到第一雷达特征。
S904的实现过程可以参考S704的相关内容,此处不再赘述。
S905,将深度图、第一视觉特征和第一雷达特征输入第二加权融合部件,得到融合特征。
图10示出了本申请一示例性实施例提供的另一种融合特征获取过程的示意图;其中,图10中的A部分示出了本申请实施例提供的雷达特征加权的一种实施方式,如图10中的A部分所示,在本申请实施例中,对第一雷达特征加权的过程包括:1)将深度图输入Encoder-Decoder(编码器-解码器)网络结构中,获取Encoder-Decoder网络结构输出的加权权重,该结构由多层卷积层、非线性激活函数、上下采样层构成。其中,Encoder结构用于对深度图中的图像信息进行编码和抽象,在本申请实施例中,则是进行空间维度上的下采样和通道维度扩张;Decoder结构用于对编码后信息进行解码和恢复,在本申请实施例中,则是将抽象后特征恢复至原始维度。深度图输入Encoder-Decoder网络中,可获得相应加权权重,该结构可充分利用深度图中的距离信息以及蕴含的轮廓信息。Encoder-Decoder网络的参数可以通过有监督训练获得;2)将该加权权重与第一雷达特征进行对应位置相乘,即可完成加权,获得第二雷达特征。
需要说明的是,图8和图10中分别提供了一种特征编码网络的可能实现形式,在应用中可以基于实际需求选择其中一种进行应用,本申请不对两种特征编码网络的实际应用场景进行限制。
图10中的B部分示出了本申请实施例提供的视觉特征加权的一种实施方式,图10中的C部分示出了本申请实施例提供的视觉雷达特征加权融合的一种实施方式,该过程可以参考图8中的B部分和C部分对应的实施例的相关内容,此处不再赘述。
S906,将融合特征输入到感知任务网络中,完成相应的感知任务。
上述步骤中所涉及的各个网络的参数获取可基于感知任务,采取有监督的端到端训练过程获取。
综上所述,本申请实施例提供的环境信息感知方法,在获取到驾驶环境中的雷达数据和视觉图像之后,通过确定雷达数据的深度图;对雷达数据的深度图与视觉图像进行第一次加权融合,得到第一视觉特征;再将第一雷达特征与第一视觉特征进行第二次加权融合,得到融合特征,最后,基于融合特征进行环境感知,得到驾驶环境中的环境信息;在上述方法中,计算机设备通过对视觉图像与雷达数据的多级加权融合,实现了对不同传感器采集的数据的充分利用和信息互补,从而提高了感知任务的精度和鲁棒性。
当本申请实施例提供的环境感知方法应用于自动驾驶场景中时,可以提高对自动驾驶环境中的环境信息的感知精度和鲁棒性,从而提高自动驾驶的安全性,同时,在不同环境下基于环境复杂度的不同对环境感知过程进行适应性调整,可以进一步保证自动驾驶系统的安全性和感知效率。
图11示出了本申请一示例性实施例提供的环境信息感知装置的方框图,该环境信息感知装置可以用于实现如图2和图3所示实施例的全部或部分步骤,如图11所示,该环境信息感知装置包括:数据获取模块1110,用于获取驾驶环境中的雷达数据和视觉图像;深度图确定模块1120,用于确定所述雷达数据的深度图;第一加权融合模块1130,用于对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视觉信息;第二加权融合模块1140,用于对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;环境感知模块1150,用于基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
在一种可能的实现方式中,所述第一加权融合模块1130,包括:图像拼接子模块,用于将所述深度图的单通道图像信息与所述视觉图像的至少一个通道图像信息进行拼接,得到拼接图像;第一加权子模块,用于对所述拼接图像中的图像信息进行加权融合处理,得到所述第一视觉特征。
在一种可能的实现方式中,所述第二加权融合模块1140,包括:第二加权子模块,用于基于所述深度图所指示的探测距离,对所述第一雷达特征进 行加权处理,得到第二雷达特征;第三加权子模块,用于基于所述第二雷达特征,对所述第一视觉特征进行加权处理,得到第二视觉特征;第四加权子模块,用于对所述第二雷达特征以及所述第二视觉特征进行加权融合处理,得到所述融合特征。
在一种可能的实现方式中,所述探测距离与对所述第一雷达特征进行加权处理的加权权重负相关。
在一种可能的实现方式中,所述第三加权子模块,用于,将所述第二雷达特征与所述第一视觉特征进行加合,得到加合特征;其中,所述第二雷达特征的通道数与所述第一视觉特征的通道数相同;所述第二雷达特征的空间维度与所述第一视觉特征的空间维度相同;将所述加合特征输入权重获取网络,得到所述视觉图像中的各个区域的加权权重;根据所述各个区域的加权权重,对所述各个区域的第一视觉特征进行加权处理,得到所述第二视觉特征。
在一种可能的实现方式中,所述视觉图像中包含目标区域;所述目标区域为所述视觉图像中图像质量低于图像质量阈值的区域;所述目标区域在所述第二视觉特征中的第二视觉特征值,小于所述目标区域在所述第一视觉特征中的第一视觉特征值。
在一种可能的实现方式中,所述第四加权子模块,用于,将所述第二雷达特征和所述第二视觉特征进行拼接,得到拼接特征;其中,所述拼接特征具有至少两个通道;对所述拼接特征进行全局平均池化处理,得到池化后特征;将所述池化后特征输入全连接层进行处理,得到处理后特征;对所述处理后特征进行非线性变换,得到所述拼接特征的各个通道的加权权重;根据所述拼接特征的各个通道的加权权重,对所述拼接特征进行加权处理,得到所述融合特征。
在一种可能的实现方式中,所述雷达数据为点云数据;所述装置还包括:雷达特征获取模块;所述雷达特征获取模块包括空间划分子模块,特征编码子模块以及雷达特征确定子模块;所述空间划分子模块,用于对所述点云数据的三维空间进行划分,得到多个体素;所述多个体素中包含至少一个非空体素,每个所述非空体素中包含至少一个所述点云数据;所述特征编码子模块,用于以所述体素为单位,对至少一组子雷达数据进行编码,得到至少一组子雷达数据分别对应的子雷达特征;所述子雷达数据包括至少一个所述非 空体素中的雷达数据;所述子雷达特征中包含局部三维空间的所述空间上下文信息;所述局部三维空间是所述子雷达特征对应的非空体素占用的所述三维空间;所述雷达特征确定子模块,用于将至少一个所述子雷达特征组成的特征集合确定为所述第一雷达特征。
综上所述,本申请实施例提供的环境信息感知装置,在获取到驾驶环境中的雷达数据和视觉图像之后,通过确定雷达数据的深度图;对雷达数据的深度图与视觉图像进行第一次加权融合,得到第一视觉特征;再将第一雷达特征与第一视觉特征进行第二次加权融合,得到融合特征,最后,基于融合特征进行环境感知,得到驾驶环境中的环境信息;在上述方法中,计算机设备通过对视觉图像与雷达数据的多级加权融合,实现了对不同传感器采集的数据的充分利用和信息互补,从而提高了感知任务的精度和鲁棒性,进而提高了自动驾驶的安全性。
图12示出了本申请一示例性实施例示出的计算机设备1200的结构框图。该计算机设备可以实现为本申请上述方案中的环境信息感知设备。所述计算机设备1200包括中央处理单元(Central Processing Unit,CPU)1201、包括随机存取存储器(Random Access Memory,RAM)1202和只读存储器(Read-Only Memory,ROM)1203的系统存储器1204,以及连接系统存储器1204和中央处理单元1201的系统总线1205。所述计算机设备1200还包括用于存储操作系统1209、客户端1210和其他程序模块1211的大容量存储设备1206。
计算机可读存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机可读存储介质包括RAM、ROM、可擦除可编程只读寄存器(Erasable Programmable Read Only Memory,EPROM)、电子抹除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)闪存或其他固态存储器技术,只读光盘(Compact Disc-ROM,CD-ROM)、数字多功能光盘(Digital Versatile Disc,DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机可读存储介质不局限于上述几种。上述的系统存储器1204和大容量存储设备1206可以统称为存储器。
根据本申请的各种实施例,所述计算机设备1200还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1200可以通过连接在所述系统总线1205上的网络接口单元1207连接到网络1208,或者说,也可以使用网络接口单元1207来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、至少一段程序、代码集或指令集存储于存储器中,中央处理器1201通过执行该至少一条指令、至少一段程序、代码集或指令集来实现上述各个实施例所示的环境信息感知方法中的全部或部分步骤。
在一示例性实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条计算机程序,该计算机程序由处理器加载并执行以实现上述环境信息感知方法中的全部或部分步骤。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在一示例性实施例中,还提供了一种计算机程序产品,该计算机程序产品包括至少一条计算机程序,该计算机程序由处理器加载并执行上述图2、图3、图7或图9任一实施例所示的环境信息感知方法的全部或部分步骤。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (12)

  1. 一种环境信息感知方法,所述方法包括:
    获取驾驶环境中的雷达数据和视觉图像;
    确定所述雷达数据的深度图;
    对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视觉信息;
    对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;
    基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
  2. 根据权利要求1所述的方法,其中,所述对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,包括:
    将所述深度图的单通道图像信息与所述视觉图像的至少一个通道图像信息进行拼接,得到拼接图像;
    对所述拼接图像中的图像信息进行加权融合处理,得到所述第一视觉特征。
  3. 根据权利要求1所述的方法,其中,所述对所述第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,包括:
    基于所述深度图所指示的探测距离,对所述第一雷达特征进行加权处理,得到第二雷达特征;
    基于所述第二雷达特征,对所述第一视觉特征进行加权处理,得到第二视觉特征;
    对所述第二雷达特征以及所述第二视觉特征进行加权融合处理,得到所述融合特征。
  4. 根据权利要求3所述的方法,其中,所述探测距离与对所述第一雷达特征进行加权处理的加权权重反相关。
  5. 根据权利要求3所述的方法,其中,所述基于所述第二雷达特征,对所述第一视觉特征进行加权处理,得到第二视觉特征,包括:
    将所述第二雷达特征与所述第一视觉特征进行加合,得到加合特征;其中,所述第二雷达特征的通道数与所述第一视觉特征的通道数相同;所述第二雷达特征的空间维度与所述第一视觉特征的空间维度相同;
    将所述加合特征输入权重获取网络,得到所述视觉图像中的各个区域的加权权重;
    根据所述各个区域的加权权重,对所述各个区域的第一视觉特征进行加权处理,得到所述第二视觉特征。
  6. 根据权利要求5所述的方法,其中,所述视觉图像中包含目标区域;所述目标区域为所述视觉图像中图像质量低于图像质量阈值的区域;
    所述目标区域在所述第二视觉特征中的第二视觉特征值,小于所述目标区域在所述第一视觉特征中的第一视觉特征值。
  7. 根据权利要求3所述的方法,其中,所述对所述第二雷达特征以及所述第二视觉特征进行加权融合处理,得到所述融合特征,包括:
    将所述第二雷达特征和所述第二视觉特征进行拼接,得到拼接特征;其中,所述拼接特征具有至少两个通道;
    对所述拼接特征进行全局平均池化处理,得到池化后特征;
    将所述池化后特征输入全连接层进行处理,得到处理后特征;
    对所述处理后特征进行非线性变换,得到所述拼接特征的各个通道的加权权重;
    根据所述拼接特征的各个通道的加权权重,对所述拼接特征进行加权处理,得到所述融合特征。
  8. 根据权利要求1所述的方法,其中,所述雷达数据为点云数据;在对所述雷达数据的第一雷达特征以及所述第一视觉特征进行加权融合处理,得到融合特征之前,所述方法还包括:
    对所述点云数据的三维空间进行划分,得到多个体素;所述多个体素中 包含至少一个非空体素,每个所述非空体素中包含至少一个所述点云数据;
    以所述体素为单位,对至少一组子雷达数据进行编码,得到至少一组子雷达数据分别对应的子雷达特征;所述子雷达数据是指一个所述非空体素中的所述点云数据;所述子雷达特征中包含局部三维空间的所述空间上下文信息;所述局部三维空间是所述子雷达特征对应的非空体素占用的所述三维空间;
    将至少一个所述子雷达特征组成的特征集合确定为所述第一雷达特征。
  9. 一种环境信息感知装置,所述装置包括:
    数据获取模块,用于获取驾驶环境中的雷达数据和视觉图像;
    深度图确定模块,用于确定所述雷达数据的深度图;
    第一加权融合模块,用于对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视觉信息;
    第二加权融合模块,用于对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;
    环境感知模块,用于基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
  10. 一种环境信息感知系统,所述系统包括图像采集设备、雷达设备以及环境感知设备;
    所述图像采集设备,用于采集驾驶环境中的视觉图像,并将所述视觉图像发送给所述环境感知设备;
    所述雷达设备,用于采集所述驾驶环境中的雷达数据,并将所述雷达数据发送给所述环境感知设备;
    所述环境感知设备,用于接收所述图像采集设备发送的所述视觉图像和所述雷达设备发送的所述雷达数据;
    确定所述雷达数据的深度图;
    对所述深度图与所述视觉图像进行第一加权融合处理,得到第一视觉特征,其中,所述第一视觉特征用于表征所述视觉图像对应的深度信息以及视 觉信息;
    对所述雷达数据的第一雷达特征以及所述第一视觉特征进行第二加权融合处理,得到融合特征,其中,所述第一雷达特征用于表征所述雷达数据在三维空间中的空间上下文信息;
    基于所述融合特征进行环境感知,得到所述驾驶环境中的环境信息。
  11. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载以执行权利要求1至8任一所述的环境信息感知方法。
  12. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载以执行权利要求1至8任一所述的环境信息感知方法。
PCT/CN2023/108450 2022-09-02 2023-07-20 环境信息感知方法、装置、系统、计算机设备及存储介质 WO2024045942A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211072265.3 2022-09-02
CN202211072265.3A CN117710931A (zh) 2022-09-02 2022-09-02 环境信息感知方法、装置、系统、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024045942A1 true WO2024045942A1 (zh) 2024-03-07

Family

ID=90100316

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/108450 WO2024045942A1 (zh) 2022-09-02 2023-07-20 环境信息感知方法、装置、系统、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN117710931A (zh)
WO (1) WO2024045942A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229366A (zh) * 2017-12-28 2018-06-29 北京航空航天大学 基于雷达和图像数据融合的深度学习车载障碍物检测方法
CN109035309A (zh) * 2018-07-20 2018-12-18 清华大学苏州汽车研究院(吴江) 基于立体视觉的双目摄像头与激光雷达间的位姿配准方法
US20200217950A1 (en) * 2019-01-07 2020-07-09 Qualcomm Incorporated Resolution of elevation ambiguity in one-dimensional radar processing
CN114022858A (zh) * 2021-10-18 2022-02-08 西南大学 一种针对自动驾驶的语义分割方法、系统、电子设备及介质
WO2022039765A1 (en) * 2020-08-17 2022-02-24 Harman International Industries, Incorporated Systems and methods for object detection in autonomous vehicles
US11275673B1 (en) * 2019-06-24 2022-03-15 Zoox, Inc. Simulated LiDAR data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229366A (zh) * 2017-12-28 2018-06-29 北京航空航天大学 基于雷达和图像数据融合的深度学习车载障碍物检测方法
CN109035309A (zh) * 2018-07-20 2018-12-18 清华大学苏州汽车研究院(吴江) 基于立体视觉的双目摄像头与激光雷达间的位姿配准方法
US20200217950A1 (en) * 2019-01-07 2020-07-09 Qualcomm Incorporated Resolution of elevation ambiguity in one-dimensional radar processing
US11275673B1 (en) * 2019-06-24 2022-03-15 Zoox, Inc. Simulated LiDAR data
WO2022039765A1 (en) * 2020-08-17 2022-02-24 Harman International Industries, Incorporated Systems and methods for object detection in autonomous vehicles
CN114022858A (zh) * 2021-10-18 2022-02-08 西南大学 一种针对自动驾驶的语义分割方法、系统、电子设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈明 (CHEN, MING): "基于激光与视觉信息融合的运动目标检测关键技术研究 (Non-official translation: Research on Key Technologies of Moving Object Detection Based on Laser and Vision Information Fusion)", 中国优秀硕士论文全文数据库 (CHINA MASTER'S THESES FULL-TEXT DATABASE), 31 July 2020 (2020-07-31), pages 41 - 56 *

Also Published As

Publication number Publication date
CN117710931A (zh) 2024-03-15

Similar Documents

Publication Publication Date Title
CN110472627B (zh) 一种端到端的sar图像识别方法、装置及存储介质
CN108895981B (zh) 一种三维测量方法、装置、服务器和存储介质
CN113284163B (zh) 基于车载激光雷达点云的三维目标自适应检测方法及系统
CN115082674A (zh) 基于注意力机制的多模态数据融合三维目标检测方法
CN111047630A (zh) 神经网络和基于神经网络的目标检测及深度预测方法
CN115147333A (zh) 一种目标检测方法及装置
CN114089329A (zh) 一种基于长短焦相机与毫米波雷达融合的目标检测方法
CN115082450A (zh) 基于深度学习网络的路面裂缝检测方法和系统
CN113888748A (zh) 一种点云数据处理方法及装置
CN116612468A (zh) 基于多模态融合与深度注意力机制的三维目标检测方法
CN115147328A (zh) 三维目标检测方法及装置
CN114463736A (zh) 一种基于多模态信息融合的多目标检测方法及装置
CN114519772A (zh) 一种基于稀疏点云和代价聚合的三维重建方法及系统
WO2023164845A1 (zh) 三维重建方法、装置、系统及存储介质
CN114998610A (zh) 一种目标检测方法、装置、设备及存储介质
CN113421217A (zh) 可行驶区域检测方法和装置
CN114638996A (zh) 基于对抗学习的模型训练方法、装置、设备和存储介质
WO2024045942A1 (zh) 环境信息感知方法、装置、系统、计算机设备及存储介质
CN116466320A (zh) 目标检测方法及装置
CN115880659A (zh) 用于路侧系统的3d目标检测方法、装置及电子设备
CN116342677A (zh) 一种深度估计方法、装置、车辆及计算机程序产品
CN115861601A (zh) 一种多传感器融合感知方法及装置
CN115497061A (zh) 一种基于双目视觉的道路可行驶区域识别方法及装置
CN115240168A (zh) 感知结果获取方法、装置、计算机设备、存储介质
CN115346184A (zh) 一种车道信息检测方法、终端及计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23858978

Country of ref document: EP

Kind code of ref document: A1