CN116012805A

CN116012805A - Object perception method, apparatus, computer device, storage medium, and program product

Info

Publication number: CN116012805A
Application number: CN202310295860.1A
Authority: CN
Inventors: 居聪; 郑伟; 刘国清; 杨广; 王启程
Original assignee: Shenzhen Minieye Innovation Technology Co Ltd
Current assignee: Shenzhen Youjia Innovation Technology Co ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-04-25
Anticipated expiration: 2043-03-24
Also published as: CN116012805B

Abstract

The present application relates to a target awareness method, apparatus, computer device, storage medium and program product. The method comprises the following steps: acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by image acquisition equipment; determining a corresponding aerial view image according to the plurality of looking around images; feature extraction is carried out on each image in the all-around images to obtain all-around image feature images corresponding to the images; determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image; determining a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and the surrounding image feature map corresponding to each image in the surrounding multiple images; and fusing the aerial view space feature images with the aerial view image feature images to obtain aerial view fused feature images, and performing target perception according to the aerial view fused feature images. By adopting the method, the accuracy of target perception can be improved.

Description

Object perception method, apparatus, computer device, storage medium, and program product

Technical Field

The present application relates to the field of autopilot technology, and in particular, to a target awareness method, apparatus, computer device, storage medium, and program product.

Background

In the field of automatic driving, it is particularly important that the vehicle body senses the surrounding environment in order to ensure driving safety.

In the related art, an image is obtained by shooting with a camera, if target perception is directly performed in an image space, the target perception cannot be accurately calculated in the image space because the camera images the image and has perspective effect and depth information is lost, so that the accuracy of target perception is not high. Therefore, how to directly perform target sensing and improve accuracy of sensing in the aerial view space capable of better sensing information such as target azimuth and distance becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a target sensing method, apparatus, computer device, storage medium, and program product that can directly perform target sensing in a bird's eye space and improve the sensing accuracy thereof.

In a first aspect, the present application provides a method of target awareness. The method comprises the following steps:

acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by image acquisition equipment;

determining a corresponding aerial view image according to the plurality of looking around images;

Feature extraction is carried out on each image in the plurality of looking around images to obtain a looking around image feature map corresponding to each image;

determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image;

determining a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and the surrounding image feature map corresponding to each image in the surrounding multiple images;

and fusing the aerial view space feature images with the aerial view image feature images to obtain aerial view fused feature images, and performing target perception according to the aerial view fused feature images.

In one embodiment, the determining the corresponding aerial view image feature map and aerial view space three-dimensional grid according to the aerial view image includes:

determining a perception range of the aerial view image;

determining a space cube based on the perception range, and cutting the space cube to obtain the aerial view space three-dimensional grid;

preprocessing the aerial view image to obtain an aerial view preprocessed image;

and extracting features of the aerial view pretreatment image to obtain the aerial view image feature map.

In one embodiment, the determining a corresponding aerial view spatial feature map according to the aerial view spatial three-dimensional grid and an annular view image feature map corresponding to each of the annular view multiple images includes:

Determining three-dimensional coordinates of center points of all cells in the aerial view space three-dimensional grid;

based on calibration parameters and mapping parameters calibrated in advance by the image acquisition equipment, respectively projecting three-dimensional coordinates of the center points of the cells onto the feature images of all the looking-around images to obtain corresponding projected pixel coordinates;

and carrying out feature sampling according to the projection pixel coordinates and the surrounding image feature map to obtain a corresponding aerial view space feature map.

In one embodiment, the fusing the aerial view spatial feature map and the aerial view image feature map to obtain an aerial view fused feature map includes:

performing mean value processing on the aerial view space feature map to obtain a processed aerial view space feature map;

straightening and convolution compression processing is carried out on the processed aerial view space feature map, so that a compressed aerial view space feature map is obtained;

and fusing the compressed aerial view space feature image and the aerial view image feature image to obtain the aerial view fused feature image.

In one embodiment, the fusing the compressed aerial view spatial feature map and the aerial view image feature map to obtain the aerial view fused feature map includes:

And performing channel stitching on the compressed aerial view space feature image and the aerial view image feature image to obtain the aerial view fusion feature image.

In one embodiment, the determining a corresponding bird's eye image according to the looking-around multiple images includes:

and respectively carrying out inverse perspective transformation on the plurality of looking-around images, and splicing to obtain the aerial view image.

In a second aspect, the present application also provides an object sensing device. The device comprises:

the image acquisition module is used for acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by the image acquisition equipment;

the bird's-eye view image determining module is used for determining a corresponding bird's-eye view image according to the plurality of looking around images;

the extraction module is used for extracting the characteristics of each image in the all-around images to obtain an all-around image characteristic diagram corresponding to each image;

the processing module is used for determining a corresponding aerial view image characteristic image and an aerial view space three-dimensional grid according to the aerial view image;

the feature determining module is used for determining a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and an all-around image feature map corresponding to each image in the all-around images;

And the fusion module is used for fusing each aerial view space feature image and each aerial view image feature image to obtain an aerial view fusion feature image, and performing target perception according to the aerial view fusion feature images.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the above-mentioned object perception method when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above-described target awareness method.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above-described target awareness method.

According to the target perception method, the device, the computer equipment, the storage medium and the program product, the corresponding aerial view space feature images are determined according to the aerial view space three-dimensional grid and the annular view image feature images corresponding to all the images in the annular view multiple images, and then the aerial view space feature images and the aerial view image feature images are fused to obtain the aerial view fusion feature images, so that the aerial view fusion feature images not only contain abundant semantic information of the images, but also contain target clues (such as rough positions, categories and the like of targets in the aerial view space) of the aerial view space carried by the aerial view images, so that the positioning of the targets can be conveniently and accurately realized, and the accuracy of target perception is further improved.

Drawings

FIG. 1 is a diagram of an application environment for a target awareness method in one embodiment;

FIG. 2 is a flow chart of a target awareness method according to one embodiment;

FIG. 3 is a schematic diagram of a plurality of images in a ring view in one embodiment;

FIG. 4 is a schematic illustration of a stitched aerial image in one embodiment;

FIG. 5 is a flow diagram of a process for determining a bird's-eye image feature map and a three-dimensional grid of a bird's-eye space in one embodiment;

FIG. 6 is a schematic illustration of a bird's eye three-dimensional space grid in one embodiment;

FIG. 7 is a flow chart illustrating steps for determining a bird's eye space feature map in one embodiment;

FIG. 8 is a flow chart illustrating steps for determining a bird's eye view fusion profile in one embodiment;

FIG. 9 is a flow chart of a target awareness method according to another embodiment;

FIG. 10 is a block diagram of an object sensing device in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The target sensing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. The intelligent vehicle 102 is an automobile capable of automatic driving, and the intelligent vehicle 102 includes a plurality of vehicle-mounted devices, such as a vehicle-mounted terminal, an image acquisition device, a vehicle-mounted radar, and the like. The smart car 102 communicates with the server 104 via a network. The target sensing method may be executed by the vehicle-mounted terminal of the intelligent automobile, or the collected data (such as an image, etc.) may be uploaded to the server 104 through a network, and then the server 104 executes the target sensing method, and the target sensing method is described by taking the vehicle-mounted terminal as an example. The data storage system may store data that the server 104 needs to process, such as storing multiple images looking around, etc. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The vehicle-mounted terminal acquires a plurality of surrounding images of the surrounding environment of the vehicle body acquired by the image acquisition equipment, and determines a corresponding aerial view image according to the surrounding images; extracting features of all images in the surrounding multiple images to obtain surrounding image feature images corresponding to all the images, determining corresponding bird's-eye view image feature images and bird's-eye view space three-dimensional grids according to the bird's-eye view images, and determining corresponding bird's-eye view space feature images according to the bird's-eye view space three-dimensional grids and surrounding image feature images corresponding to all the images in the surrounding multiple images; and fusing the aerial view space feature images and the aerial view image feature images to obtain an aerial view fused feature image, and performing target perception according to the aerial view fused feature image. The server 104 may be implemented as a stand-alone server or a server cluster including a plurality of servers.

In one embodiment, as shown in fig. 2, a target sensing method is provided, and the method is applied to the vehicle-mounted terminal in fig. 1 for illustration, and includes the following steps:

step 202, acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by the image acquisition equipment.

The image capturing apparatus may refer to an apparatus mounted on a vehicle for capturing an image. The image acquisition device may be a fisheye camera, a wide angle camera, a pinhole camera, or the like. The image pickup apparatus on the vehicle may be installed in plural, for example, the image pickup apparatus may be installed in front of, behind, left of and right of the vehicle to pick up the front, rear, left and right related images of the vehicle.

The vehicle body surroundings may refer to the environment in which the vehicle is located, such as the current surroundings of the smart car 102 with autopilot functionality.

Looking around multiple images may refer to images used to characterize the environment surrounding the vehicle, including multiple images, each of which is responsible for perceiving an area target in a fixed orientation relative to the vehicle.

For example, if the image pickup apparatus is installed in four directions of the front, rear, left, and right of the vehicle, the image pickup apparatus may pick up four images of the four directions. After each image is shot by the image acquisition device, the image acquisition device can be stored in a local memory, or the image acquisition device can send each image to the server 104 through a network so as to store each image in a data storage system of the server 104, so that archiving backup of each image is realized, real-time retrieval is facilitated, and the images can be directly and synchronously sent to the vehicle-mounted terminal. The vehicle-mounted terminal can directly acquire a plurality of looking-around images (such as 4 images respectively representing the front direction, the rear direction, the left direction and the right direction) of the surrounding environment of the vehicle body acquired in real time by the image acquisition equipment through a network or a data transmission line, and can acquire each image stored in a data storage system of the server 104 through the network.

As shown in fig. 3, the 4 images acquired may be front in fig. 3, rear in the front of the vehicle, left in the left of the vehicle, and right in the right of the vehicle.

Step 204, determining a corresponding aerial view image according to the plurality of looking around images.

The bird's-eye image may be an image obtained by observing the vehicle in a bird's-eye space and from a bird's-eye view angle. The bird's eye view image is proportional to the real space (e.g., parking lot, etc.) in which the vehicle is located.

For example, the multiple looking-around images may be first subjected to inverse perspective transformation, respectively, and then stitched to obtain the corresponding bird's eye image.

For example, the image acquisition device acquires 4 images with resolution of 800×1280, where 800 is high and 1280 is wide, and can perform inverse perspective transformation on the 4 images with ground as an assumed plane, and then stitch the images to obtain a bird's eye view image with resolution of 3360×2464. Each pixel in the bird's-eye view image represents 0.974 cm in real space, and is represented by a scale, that is, scale=0.974 cm, 3360×2464 may represent a length around the vehicle, each about 16.3636 meters in front-rear direction, and each about 12 meters in length.

For example, as shown in fig. 3, the acquired 4 images may be the corresponding spliced bird's-eye images may be as shown in fig. 4, and the black frame in fig. 4 represents the vehicle.

In some embodiments, step 204 includes, but is not limited to: and performing inverse perspective transformation on the plurality of surrounding images, and splicing to obtain a bird's-eye view image.

Wherein the inverse perspective transformation may refer to an inverse perspective mapping (Inverse Perspective Mapping, IPM), referring to an assumption-based camera imaging inverse process.

And respectively carrying out inverse perspective transformation on the plurality of surrounding images to splice and obtain a bird's-eye view image.

For example, the ground is taken as an assumed plane, then the IPM algorithm is performed on the 4 looking-around images, and then the corresponding bird's-eye view images are obtained by stitching.

And 206, extracting features of each image in the looking-around multiple images to obtain a looking-around image feature map corresponding to each image.

The looking-around image feature map may be used to represent features of a corresponding image in the looking-around plurality of images. The look-around image feature map can be obtained through feature extraction by a neural network model. The feature map of the looking-around image corresponding to each image is a high-dimensional matrix in a computer, such as a feature map of a matrix with the size of C multiplied by 48 multiplied by 80, the feature map has the size of 48 multiplied by 80, and 48 multiplied by 80 feature positions are all arranged, and the C-dimensional feature vector corresponding to each feature position characterizes the C-dimensional feature vector of the feature position. The feature locations are also understood to be pixel locations of the feature map.

For example, the multiple looking-around images can be input into a neural network for feature extraction, so as to obtain a looking-around image feature map for representing features of corresponding images in the multiple looking-around images.

In some embodiments, the step of performing feature extraction on each image in the looking-around multiple images to obtain a looking-around image feature map corresponding to each image includes, but is not limited to, the following steps: preprocessing each image in the all-around images to obtain preprocessed all-around images, inputting the preprocessed all-around images into a neural network for forward operation, and obtaining output of the neural network to obtain an all-around image feature map corresponding to each image.

The preprocessing may refer to a process of processing the looking-around multiple images into ideal images in advance before feature extraction is performed on the looking-around multiple images. The ideal image can give consideration to both rationality of the calculated amount of the algorithm and effective detection effect. The preprocessing comprises scale transformation such as image reduction processing, data enhancement such as rotation processing, perspective processing and color space processing, and normalization processing such as mean reduction and variance reduction. When the resolution of the image is too large, the image resolution can be reduced by performing reduction processing on the image, thereby reducing the calculation amount of the algorithm in processing the image. In the embodiment of the application, the resolution of looking around the multiple images is 800×1280, and when the resolution of each image is 800×1280, the reduction processing can be performed on each image, so as to reduce the calculation amount of each image processing of the algorithm.

Illustratively, each image is subjected to reduction processing to obtain each image after pretreatment, and then each image after pretreatment is input into a neural network to perform forward operation to obtain a look-around image feature map corresponding to each image.

For example, a Convolutional Neural Network (CNN) may be used to extract a look-around image feature map for each image. When each image is subjected to feature extraction, the structure and parameters of the convolutional neural network are the same. If the original resolution of each image is 800×1280, firstly, each image is reduced by 2 times to obtain an image with resolution of 400×640, then 16 pixels at the lowest part of the image are cut off according to the height of the reduced image to obtain each image after preprocessing with resolution of 384×640, and therefore preprocessing of multiple images is achieved. Each preprocessed image corresponds to a 3×384×640 input feature map, the input feature map corresponding to each image is input into a convolutional neural network, and the forward operation of the convolutional neural network is performed to obtain a look-around image feature map with the size of c×48×80 corresponding to each image, wherein 3 channels of the RGB color map stored in a computer are corresponding to the 3 images, and C is the number of channels and is the feature length of the feature position after feature extraction.

Firstly, reducing a plurality of looking around images by 2 times, so that the calculation amount of a subsequent convolutional neural network is adapted, and meanwhile, the image target is ensured not to be blurred, and the detection effect is ensured; by clipping the lowest pixel of the scaled down image, the neural network is guaranteed to be downsampled multiple times (e.g., from [384,640] to [192,320] for one downsampling, then to obtain a 48×80 look-around feature map, downsampling is required 5 times, 384 just divides 32 (5 th power of 2)). The coordinate system of the image is a coordinate system extending downward from the upper left corner to the right. The coordinates of each pixel of the image are not changed when the pixels below the image are cut, so that the calculation of subsequent projection is convenient, invalid information of a vehicle top cover corresponding to the lower part of the image can be removed, and the information above the image is prevented from being lost.

And step 208, determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image.

The bird's-eye view image feature map may refer to a high-dimensional matrix for characterizing features of the bird's-eye view image.

The bird's-eye space three-dimensional grid defines a perception range of the bird's-eye space, and the bird's-eye space three-dimensional grid comprises a plurality of small stereoscopic cells.

For example, grid initialization can be performed on the aerial view space corresponding to the aerial view image to obtain a three-dimensional grid of the aerial view space corresponding to the aerial view image, and the aerial view image is input into a convolutional neural network to perform feature extraction to obtain a corresponding aerial view image feature map.

Step 210, determining a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and the panoramic image feature map corresponding to each image in the plurality of images.

The aerial view space feature map may refer to a feature for representing a camera view angle corresponding to an aerial view image feature map corresponding to each of the plurality of aerial images. In the aerial view space, the vehicle is positioned at the center, and a plurality of image acquisition devices, such as cameras, are arranged around the vehicle body, wherein each camera has a corresponding field angle, namely a sensing area. Therefore, the panoramic image feature map corresponding to each image corresponds to the field angle of the camera, and the aerial space feature map corresponding to each image corresponds to the feature of the fixed azimuth of the aerial space. For example, the front view image can only image the real world of the front view area, the surrounding image feature map corresponding to the front view image corresponds to the feature of the image space of the front view area, and the aerial view space feature map corresponding to the front view image corresponds to the feature of the aerial view space of the front view area.

The coordinates of each cell can be determined in the three-dimensional grid of the aerial view space, then the central coordinates of the cells are projected into each image, and sampling is performed in the surrounding image feature map corresponding to each image according to the coordinates projected into each image, so that the aerial view space feature map corresponding to each image is obtained.

And step 212, fusing the aerial view space feature images and the aerial view image feature images to obtain an aerial view fused feature image, and performing target perception according to the aerial view fused feature image.

The feature fusion may be used to fuse features carried by different feature vectors or feature graphs into the same target feature vector or feature graph, so that the target feature vector or feature graph obtained by fusion has features carried by all feature vectors or feature graphs before fusion.

The bird's-eye view fusion feature map may refer to a feature obtained by fusing feature information of a bird's-eye view space feature map corresponding to each of the plurality of images and feature information of the bird's-eye view image feature map. The bird's-eye view fusion feature map not only contains abundant semantic information of each image, but also contains target clues (such as rough positions, categories and the like of targets in the bird's-eye view space) of the bird's-eye view space carried in the bird's-eye view image feature map.

Illustratively, the aerial view space feature map corresponding to each image may be subjected to mean processing to obtain a processed aerial view space feature map; straightening and convolution compression processing is carried out on the processed aerial view space feature map, so that a compressed aerial view space feature map is obtained; and then fusing the compressed aerial view space feature image and the aerial view image feature image to obtain a corresponding aerial view fused feature image, and performing target perception according to the aerial view fused feature image, wherein perception tasks such as target detection, map segmentation, motion planning and the like can be performed.

Such as in target detection, a training phase and an inference phase may be included, wherein:

the training phase comprises the following steps: inputting the aerial view fusion feature map into a second-stage convolutional neural network (the neural network for processing the plurality of images in the surrounding view and the neural network for processing the aerial view are collectively called as a first-stage neural network) for forward operation, acquiring an output result of the second-stage convolutional neural network, performing supervised learning on the output result and labels input correspondingly to the target detection task, and calculating corresponding supervision loss; the labeling information comprises position information and category information of the target. And updating parameters of the convolutional neural network at the first stage and the second stage by using a back propagation algorithm according to the supervision loss to obtain the trained convolutional neural network.

The reasoning stage comprises: and inputting the surrounding multiple images and the aerial view image spliced by the surrounding multiple images into a trained convolutional neural network to obtain a corresponding output result. And decoding the output result to obtain the position information and the category information of the targets such as vehicles, pedestrians and the like, wherein the position information corresponds to the BEV space position measured in the real world, thereby realizing target detection.

According to the target perception algorithm, the corresponding aerial view space feature images are determined according to the aerial view space three-dimensional grid and the surrounding image feature images corresponding to the images in the surrounding multiple images, then the aerial view space feature images and the aerial view image feature images are fused to obtain the aerial view fusion feature images, so that the aerial view fusion feature images not only contain abundant semantic information of the images, but also contain target clues (such as rough positions, categories and the like of targets in the aerial view space) of the aerial view space carried by the aerial view images, and therefore positioning of the targets can be conveniently and accurately achieved, and accuracy of target perception is improved.

Referring to fig. 5, in some embodiments, the step of "determining a corresponding aerial image feature map and aerial space three-dimensional grid from the aerial image" includes, but is not limited to, the steps of:

Step 502, determining a perception range of the bird's eye image.

The sensing range may refer to a range in which resolution of the bird's eye view image corresponds to real space.

For example, the resolution of the bird's-eye view image is 3360×2464, and each pixel in the bird's-eye view image corresponds to 0.974 cm in real space, and the corresponding sensing range is 16.3636 m in front and rear directions and 12 m in left and right directions around the vehicle, that is, the vehicle is the center, and the corresponding sensing range is [ -16.3636,16.3636], [ -12,12 ].

And step 504, determining a space cube based on the perception range, and cutting the space cube to obtain the aerial view space three-dimensional grid.

Illustratively, a three-dimensional coordinate system of the vehicle body is established with the vehicle as an origin, the y-axis positive direction is directed to the front of the vehicle, the x-axis positive direction is directed to the left of the vehicle, the z-axis positive direction is directed to the top of the vehicle, and the xy-plane is parallel to the ground on which the vehicle is located. And then constructing a space cube according to the perception range, and cutting the space cube based on a preset voxel unit to obtain a corresponding aerial view space three-dimensional grid.

For example, referring to fig. 6, fig. 6 is a schematic diagram of a bird's eye three-dimensional space grid provided in one embodiment. One small cube in fig. 6 represents one cell. When the perception range is [ -16.3636,16.3636], [ -12,12], a vehicle can be used as an origin, a space cube is formed according to the range of x epsilon [ -12,12], y epsilon [ -16.3636,16.3636], z epsilon [ -1.5,2.5], then the space cube is cut in voxel units, and an approximate 120×88×4 aerial view space three-dimensional grid is obtained, wherein the voxel units are (28×scale, 1), and scale=0.974 cm. The bird's-eye space three-dimensional grid comprises a plurality of cells, wherein the length of each cell in the x direction and the y direction is 28×scale, the length of each cell in the z direction is 1, and the total number of the cells is 120×88×4.

And step 506, preprocessing the aerial view image to obtain an aerial view preprocessed image.

The preprocessing may be a preprocessing of the bird's-eye image, and the preprocessing may be a preprocessing of the bird's-eye image, which may be capable of processing the bird's-eye image into a bird's-eye preprocessed image that combines both the rationality of the algorithm calculation amount and the effective detection effect. The preprocessing comprises scale transformation such as image reduction processing, data enhancement such as rotation processing, perspective processing and color space processing, and normalization processing such as mean reduction and variance reduction. When the resolution of the image is too large, the image resolution can be reduced by performing reduction processing on the image, thereby reducing the calculation amount of the algorithm in processing the image.

The bird's-eye view preprocessed image may refer to a bird's-eye view image after preprocessing.

Illustratively, in the embodiment of the present application, the resolution of the aerial view image is large, and the aerial view image needs to be reduced, so as to implement preprocessing of the aerial view image, and obtain a corresponding aerial view preprocessed image.

For example, the original resolution of the bird's-eye view image is 3360×2464, and the resolution of the bird's-eye view preprocessed image obtained by reducing the original resolution by 7 times is 408×352.

And step 508, extracting features of the aerial view pretreatment image to obtain an aerial view image feature map.

The feature extraction can extract features in the aerial view image, and the extracted features comprise features such as textures and semantics of the image. Since the bird's-eye view preprocessing image is obtained by preprocessing the bird's-eye view image, the bird's-eye view image features include abstract features such as textures and semantics in the bird's-eye view image.

For example, the aerial view pretreatment image may be input into a convolutional neural network, and forward computation is performed through the convolutional neural network, so as to extract features of the aerial view pretreatment image, and obtain a corresponding aerial view image feature map.

For example, the resolution of the bird's eye view preprocessing image is 480×352, the input is 3×480×352 input feature map, and 3 corresponds to 3 channels of the color image itself. And carrying out convolution processing on the input feature map through a convolution neural network to obtain a Cx120x88 aerial view image feature map. Wherein, C refers to the number of channels.

The model parameters of the convolutional neural network for processing the bird's eye view preprocessed image and the model parameters of the convolutional neural network for processing the plurality of images are different.

According to the technical scheme, the space cube is determined by determining the perception range of the aerial view image, and then the aerial view space three-dimensional grid is obtained by cutting, the three-dimensional grid defines the position of converting the image features extracted from the looking-around multiple images from the image space to the aerial view space proportional to the real world, so that target perception can be directly achieved in the aerial view space, and the automobile can better perceive surrounding environment. The bird's-eye view image feature map is obtained by preprocessing and feature extraction of the bird's-eye view image, so that feature fusion of the bird's-eye view image feature map and the bird's-eye view space feature map corresponding to each image is conveniently realized, the corresponding bird's-eye view fusion feature map is obtained, and accuracy of target perception is improved.

Referring to fig. 7, in some embodiments, the step of determining a corresponding aerial view spatial feature map from the aerial view spatial three-dimensional grid and the panoramic image feature map corresponding to each of the plurality of images includes, but is not limited to, the steps of:

step 702, determining three-dimensional coordinates of center points of cells in the bird's eye-space three-dimensional grid.

Where a cell center point may refer to the geometric center of the cell. Such as when the cell is a small cube or cuboid, the cell center point may be the body center of the cell.

Illustratively, the center point of each cell is found first, and then the three-dimensional coordinate corresponding to the center point of each cell is obtained according to the three-dimensional coordinate system established in the previous step.

And step 704, based on calibration parameters and mapping parameters calibrated in advance by the image acquisition equipment, respectively projecting the three-dimensional coordinates of the central points of the cells onto all the looking-around image feature graphs to obtain corresponding projection pixel coordinates.

The mapping parameter may be used to represent a scaling relationship between each image in the looking-around multiple images and the corresponding feature map, where the scaling relationship includes a scaling factor in image preprocessing and a sampling factor in feature extraction of the neural network. The mapping parameters may be preset in the terminal or the server, or may be calculated according to actual situations.

For example, when each image of the plurality of images is read around and preprocessed, the resolution of each image is 800×1280, the image is reduced by 2 times to obtain a 400×640 image, a 384×640 image is obtained through a clipping step without influencing pixel coordinates, the image is input into a neural network model to obtain a feature map with the size of 48×80, and the neural network model is downsampled by 8 times, so that the corresponding mapping parameter is 1/2×8=1/16.

The calibration parameters comprise internal parameters and external parameters of the image acquisition equipment, wherein the internal parameters and the external parameters of the image acquisition equipment are calibrated parameters in advance. For example, when the image acquisition device is a camera, the calibration parameters include internal and external parameters of the camera, which can be used for the calculation of the pixel coordinate system of the three-dimensional coordinate projection onto the image under the camera coordinate system. The camera external parameters can be used for calculating three-dimensional coordinates projected to a camera coordinate system by the three-dimensional coordinates under a vehicle body coordinate system.

The method includes the steps of firstly, respectively projecting three-dimensional coordinates of all cell center points under a camera coordinate system of each camera according to calibrated camera external parameters to obtain three-dimensional coordinates under the camera coordinate system, then, projecting the three-dimensional coordinates under the camera coordinate system to an image coordinate system according to camera internal parameters to obtain pixel coordinates under the image coordinate system, finally, calculating pixel coordinates on a corresponding all-around image feature map according to the mapping parameters, and determining the pixel coordinates on the all-around image feature map corresponding to each image as projection pixel coordinates corresponding to each image.

And step 706, performing feature sampling according to the coordinates of each projection pixel and the surrounding image feature map to obtain a corresponding aerial view space feature map.

The number of the projection pixel coordinates is the same as the number of the cells of the aerial space three-dimensional grid, and each projection pixel coordinate is generally a floating point number. Feature sampling may refer to a processing manner of further screening and extracting features carried in a feature map. Such as bilinear interpolation, neighbor interpolation, etc.

For example, taking a single projection pixel coordinate corresponding to an image, firstly calculating a C-dimensional feature vector corresponding to the projection pixel coordinate from a looking-around image feature map corresponding to the image by using a bilinear interpolation algorithm according to a projection pixel coordinate system, wherein the feature vector is used as a feature vector of a cell of an original aerial space three-dimensional grid corresponding to the projection pixel coordinate. Thus, for all projected pixel coordinates of a single image, a C-dimensional feature vector is obtained for each cell, 120×88×4 cells in total, and thus a c×120×88×4 aerial space feature image can be formed. The 4 images obtained 4 corresponding bird's eye space feature maps of c×120×88×4, and the data form was 4×c×120×80×4. Wherein the first 4 represents 4 images, the second 4 represents that the aerial view space feature map is divided into 4 layers according to the height, and C represents the number of channels.

It can be understood that the central points of all the cells of the aerial view space three-dimensional grid are respectively projected onto the surrounding image feature map corresponding to each image to obtain the corresponding feature vector, so that the aerial view space feature map corresponding to each image is obtained. For an image, not all cell center points are valid to be projected onto the corresponding look-around image feature map of the image. For the front view image, the front view camera can only image the real world in a certain view of the front view area, so that only the center points of the cells in the front view certain area of the bird's eye view space three-dimensional grid can be projected onto the surrounding view image feature map corresponding to the front view image to obtain the feature vector, and other ineffective center points of the cells are initialized to be all 0C as the feature vector.

It can be understood that the 4 corresponding c×120×88×4 aerial view space feature maps, the aerial view space feature map corresponding to the front view image only obtains the effective feature vector in the area above the feature map, the aerial view space feature map corresponding to the left view image only obtains the effective feature vector in the area to the left of the feature map, the aerial view space feature map corresponding to the right view image only obtains the effective feature vector in the area to the right of the feature map, and the aerial view space feature map corresponding to the rear view image only obtains the effective feature vector in the area to the rear of the feature map.

According to the technical scheme, the projection coordinates from the three-dimensional coordinates in the aerial view space to the surrounding image feature map are determined through the calibration parameters and the mapping parameters, the feature in the surrounding image feature map can be transferred from the image space to the aerial view space through projection sampling, the aerial view space feature map is obtained, target perception can be directly carried out on the aerial view space by utilizing the feature map, and the substantivity and convenience of the surrounding environment of automobile perception are improved.

Referring to fig. 8, in some embodiments, the step of fusing the aerial view spatial feature map and the aerial view image feature map corresponding to each image to obtain an aerial view fused feature map, and performing target perception according to the aerial view fused feature map includes, but is not limited to, the following steps:

step 802, performing mean value processing on the aerial view space feature map to obtain a processed aerial view space feature map.

The method comprises the steps of firstly obtaining data dimension representing the number of cameras, then carrying out mean value processing based on the data dimension, and finally obtaining a processed aerial view space feature map in a dimension changing mode.

For example, the bird's-eye view space feature map data form corresponding to each image is a feature map of 4×c×120×88×4, the mean value is calculated according to the first dimension (the data dimension representing the number of cameras) to obtain a feature map of c×120×88×4, and then the dimension change processing is performed to obtain a processed bird's-eye view space feature map of c×4×120×88.

Step 804, performing straightening and convolution compression processing on the processed aerial view space feature map to obtain a compressed space feature map.

Illustratively, the processed aerial view space feature map is straightened to obtain a high-dimensional feature map, and then convolved and compressed to obtain a compressed space feature map.

For example, after obtaining a processed bird's-eye view space feature map of c×4×120×88, the space feature map is first straightened to obtain a feature map of (c×4) ×120×88, where c×4 belongs to a high-dimensional feature, and then compressed into a C-dimensional feature by convolution of 1×1, to obtain a compressed bird's-eye view space feature map of c×120×88.

And step 806, fusing the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fused feature map.

Illustratively, the compressed aerial view spatial feature map and the aerial view image feature map may be subjected to channel stitching to obtain an aerial view fusion feature map through fusion.

For example, the size of the compressed aerial view space feature map is c×120×88, and the scale of the real space corresponding to each feature point is: (28 XScale ), i.e., the range characterized is 16.3636 meters back and forth, about 12 meters. The original resolution of the aerial view image is 3360×2464, one pixel point represents scale=0.974 cm of the real space, the calculated scale and the calculated scale are consistent, the size of the aerial view image feature image is c×120×88, and the corresponding scale of the aerial view image feature image and the compressed space feature image are consistent, so that the aerial view image feature image and the compressed space feature image are aligned, and channel stitching can be directly performed to fuse to obtain an aerial view fused feature image, and the size of the aerial view fused feature image is 2c×120×88.

According to the technical scheme, the aerial view space feature map obtained by feature sampling projected to the image space and the aerial view image feature map are subjected to channel splicing processing to be fused to obtain the aerial view fusion feature map, so that the aerial view fusion feature map not only contains abundant semantic information in a plurality of surrounding images, but also contains target clues (such as rough positions, categories and the like of targets in the aerial view space) in the aerial view image, and accuracy of target perception is improved.

In some embodiments, the step of fusing the compressed aerial view spatial feature map and the aerial view image feature map to obtain the aerial view fused feature map includes, but is not limited to, the following steps: and performing channel stitching on the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fusion feature map.

Illustratively, when the compressed aerial view spatial feature map and the aerial view image feature map are fused, the operation adopted is channel stitching, so that the compressed aerial view spatial feature map and the aerial view image feature map are channel stitched to obtain an aerial view fused feature map.

Referring to fig. 9, in some embodiments, a target awareness method is provided, including but not limited to the following steps:

Step 902, acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by the image acquisition equipment.

And 904, respectively performing inverse perspective transformation on the plurality of looking-around images, and splicing to obtain a bird's-eye view image.

Step 906, determining a perception range of the bird's eye image.

Step 908, determining a space cube based on the perception range, and cutting the space cube to obtain the aerial view space three-dimensional grid.

Step 910, preprocessing the aerial view image to obtain an aerial view preprocessed image.

And 912, extracting features of the aerial view pretreatment image to obtain an aerial view image feature map.

Step 914, extracting features of each image in the looking-around multiple images to obtain a looking-around image feature map corresponding to each image.

In step 916, three-dimensional coordinates of the center points of the cells in the bird's eye-space three-dimensional grid are determined.

Step 918, based on calibration parameters and mapping parameters calibrated in advance by the image acquisition equipment, respectively projecting the three-dimensional coordinates of the central points of the cells onto the feature images of all the looking-around images to obtain corresponding projection pixel coordinates.

And step 920, performing feature sampling according to the coordinates of each projection pixel and the surrounding image feature map to obtain a corresponding aerial view space feature map.

And 922, carrying out mean value processing on the aerial view space feature map to obtain a processed aerial view space feature map.

Step 924, performing straightening and convolution compression processing on the processed aerial view space feature map to obtain a compressed space feature map.

And step 926, performing channel stitching on the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fusion feature map.

Specifically, the specific steps of steps 902 to 926 are shown in the embodiments of fig. 1 to 8, and are not described herein.

According to the technical scheme, the bird's eye view image and the surrounding multiple images are preprocessed, so that the corresponding preprocessed images can have a certain detection effect and reasonable calculated amount of the algorithm when the algorithm processes the images; the three-dimensional coordinates of the center points of the cells are respectively projected onto the surrounding image feature images corresponding to each image through the calibration parameters and the mapping parameters calibrated in advance by the image acquisition equipment to obtain corresponding projection pixel coordinates, so that the surrounding image features extracted from each image are conveniently converted from an image space to a bird's-eye view space proportional to the real world according to the projection pixel coordinates, the target perception in the bird's-eye view space is facilitated, and the substantivity and convenience of the surrounding environment of the automobile perception are improved; the bird's-eye view space feature images and the bird's-eye view image feature images are fused to obtain a bird's-eye view fusion feature image, so that the bird's-eye view fusion feature image contains abundant semantic information of each image, and further comprises target clues (such as rough positions, categories and the like of targets in the bird's-eye view space) of the bird's-eye view space carried in the bird's-eye view image feature image, thereby being convenient for accurately realizing the positioning of the targets and further improving the accuracy of target perception.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an object sensing device for realizing the above-mentioned object sensing method. The implementation of the solution provided by the device is similar to the implementation described in the method above.

In one embodiment, as shown in fig. 10, there is provided an object sensing device comprising: an image acquisition module 1002, a bird's-eye image determination module 1004, an extraction module 1006, a processing module 1008, a feature determination module 1010, and a fusion module 1012, wherein:

The image acquisition module 1002 is configured to acquire a plurality of looking-around images of the surrounding environment of the vehicle body acquired by the image acquisition device.

The bird's-eye view image determining module 1004 is configured to determine a corresponding bird's-eye view image according to the looking-around multiple images.

The extracting module 1006 is configured to perform feature extraction on each image in the looking-around multiple images to obtain a looking-around image feature map corresponding to each image.

And the processing module 1008 is used for determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image.

The feature determining module 1010 is configured to determine a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and an annular view image feature map corresponding to each of the annular view multiple images.

And the fusion module 1012 is used for fusing the aerial view space feature images and the aerial view image feature images to obtain an aerial view fusion feature image, and performing target perception according to the aerial view fusion feature image.

In some embodiments, the processing module 1008 is further configured to determine a perception range of the bird's eye image; determining a space cube based on the perception range, and cutting the space cube to obtain a bird's-eye space three-dimensional grid; preprocessing the aerial view image to obtain an aerial view preprocessed image; and extracting features of the aerial view pretreatment image to obtain an aerial view image feature map.

In some embodiments, the feature determination module 1010 is further configured to determine three-dimensional coordinates of a center point of each cell in the bird's eye-space three-dimensional grid; based on calibration parameters and mapping parameters calibrated in advance by the image acquisition equipment, respectively projecting three-dimensional coordinates of the central points of the cells onto all looking-around image feature graphs to obtain corresponding projection pixel coordinates; and performing feature sampling according to the projection pixel coordinates and the looking-around image feature map to obtain a corresponding aerial view space feature map.

In some embodiments, the fusion module 1012 is further configured to perform mean processing on the aerial view spatial feature map, to obtain a processed aerial view spatial feature map; straightening and convolution compression processing is carried out on the processed aerial view space feature map, and a compressed aerial view space feature map is obtained; and fusing the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fused feature map.

In some embodiments, the fusion module 1012 is further configured to perform channel stitching on the compressed aerial view spatial feature map and the aerial view image feature map, so as to obtain an aerial view fusion feature map.

In some embodiments, the aerial image determining module 1004 is further configured to perform inverse perspective transformation on the multiple looking-around images, and stitch the multiple looking-around images to obtain aerial images.

The modules in the above-described object sensing device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 11. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a target awareness method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by image acquisition equipment; determining a corresponding aerial view image according to the plurality of looking around images; feature extraction is carried out on each image in the plurality of looking around images to obtain a looking around image feature map corresponding to each image; determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image; determining a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and the surrounding image feature map corresponding to each image in the surrounding multiple images; and fusing the aerial view space feature images and the aerial view image feature images to obtain an aerial view fused feature image, and performing target perception according to the aerial view fused feature image.

In one embodiment, the processor when executing the computer program further performs the steps of: determining a perception range of the aerial view image; determining a space cube based on the perception range, and cutting the space cube to obtain a bird's-eye space three-dimensional grid; preprocessing the aerial view image to obtain an aerial view preprocessed image; and extracting features of the aerial view pretreatment image to obtain an aerial view image feature map.

In one embodiment, the processor when executing the computer program further performs the steps of: determining three-dimensional coordinates of center points of all cells in the aerial view space three-dimensional grid; based on calibration parameters and mapping parameters calibrated in advance by the image acquisition equipment, respectively projecting three-dimensional coordinates of the central points of all the cells onto the looking-around image feature map to obtain corresponding projected pixel coordinates; and performing feature sampling according to the coordinates of each projection pixel and the surrounding image feature map to obtain a corresponding aerial view space feature map.

In one embodiment, the processor when executing the computer program further performs the steps of: carrying out mean value processing on the aerial view space feature map to obtain a processed aerial view space feature map; straightening and convolution compression processing is carried out on the processed aerial view space feature map, and a compressed aerial view space feature map is obtained; and fusing the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fused feature map.

In one embodiment, the processor when executing the computer program further performs the steps of: and performing channel stitching on the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fusion feature map.

In one embodiment, the processor when executing the computer program further performs the steps of: and respectively carrying out inverse perspective transformation on the plurality of surrounding images, and splicing to obtain a bird's-eye view image.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by image acquisition equipment; determining a corresponding aerial view image according to the plurality of looking around images; feature extraction is carried out on each image in the plurality of looking around images to obtain a looking around image feature map corresponding to each image; determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image; determining a bird's-eye view space feature map corresponding to each image according to the bird's-eye view space three-dimensional grid and the surrounding image feature map corresponding to each image in the surrounding images; and fusing the aerial view space feature images and the aerial view image feature images to obtain an aerial view fused feature image, and performing target perception according to the aerial view fused feature image.

In one embodiment, the computer program when executed by the processor further performs the steps of: determining a perception range of the aerial view image; determining a space cube based on the perception range, and cutting the space cube to obtain a bird's-eye space three-dimensional grid; preprocessing the aerial view image to obtain an aerial view preprocessed image; and extracting features of the aerial view pretreatment image to obtain an aerial view image feature map.

In one embodiment, the computer program when executed by the processor further performs the steps of: determining three-dimensional coordinates of center points of all cells in the aerial view space three-dimensional grid; based on calibration parameters and mapping parameters calibrated in advance by the image acquisition equipment, respectively projecting three-dimensional coordinates of the central points of all the cells onto the looking-around image feature map to obtain corresponding projected pixel coordinates; and performing feature sampling according to the coordinates of each projection pixel and the surrounding image feature map to obtain a corresponding aerial view space feature map.

In one embodiment, the computer program when executed by the processor further performs the steps of: carrying out mean value processing on the aerial view space feature map to obtain a processed aerial view space feature map; straightening and convolution compression processing is carried out on the processed aerial view space feature map, and a compressed aerial view space feature map is obtained; and fusing the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fused feature map.

In one embodiment, the computer program when executed by the processor further performs the steps of: and performing channel stitching on the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fusion feature map.

In one embodiment, the computer program when executed by the processor further performs the steps of: and respectively carrying out inverse perspective transformation on the plurality of surrounding images, and splicing to obtain a bird's-eye view image.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of: acquiring a plurality of looking-around images of the surrounding environment of the vehicle body acquired by image acquisition equipment; determining a corresponding aerial view image according to the plurality of looking around images; feature extraction is carried out on each image in the plurality of looking around images to obtain a looking around image feature map corresponding to each image; determining a corresponding aerial view image feature map and an aerial view space three-dimensional grid according to the aerial view image; determining a corresponding aerial view space feature map according to the aerial view space three-dimensional grid and the surrounding image feature map corresponding to each image in the surrounding multiple images; and fusing the aerial view space feature images and the aerial view image feature images to obtain an aerial view fused feature image, and performing target perception according to the aerial view fused feature image.

In one embodiment, the computer program when executed by the processor further performs the steps of: carrying out mean value processing on the aerial view space feature map to obtain a processed aerial view space feature map; straightening and convolution compression processing is carried out on the processed aerial view space feature map, and a compressed space feature map is obtained; and fusing the compressed aerial view space feature map and the aerial view image feature map to obtain an aerial view fused feature map.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric RandomAccess Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can take many forms, such as static Random access memory (Static Random Access Memory, SRAM) or Dynamic Random access memory (Dynamic Random AccessMemory, DRAM), among others. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of target perception, the method comprising:

feature extraction is carried out on each image in the all-around images to obtain all-around image feature images corresponding to the images;

2. The method according to claim 1, wherein the determining the corresponding bird's-eye image feature map and bird's-eye space three-dimensional grid from the bird's-eye image comprises:

determining a perception range of the aerial view image;

3. The method of claim 1, wherein the determining a corresponding aerial view spatial feature map from the aerial view spatial three-dimensional grid and the corresponding panoramic image feature map for each of the plurality of images comprises:

and performing feature sampling according to the projection pixel coordinates and the surrounding image feature map to obtain a corresponding aerial view space feature map.

4. A method according to any one of claims 1 to 3, wherein the fusing each of the aerial view spatial feature map and the aerial view image feature map to obtain an aerial view fused feature map comprises:

5. The method according to claim 4, wherein the fusing the compressed aerial view spatial feature map and the aerial view image feature map to obtain the aerial view fused feature map includes:

6. A method according to any one of claims 1 to 3, wherein said determining a corresponding bird's eye image from said looking-around plurality of images comprises:

7. An object sensing device, the device comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.