CN115641581A

CN115641581A - Target detection method, electronic device, and storage medium

Info

Publication number: CN115641581A
Application number: CN202211214510.XA
Authority: CN
Inventors: 周鸿宇; 葛政
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-24

Abstract

The embodiment of the application discloses a target detection method, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting semantic features of images in an image group to obtain a two-dimensional semantic feature map, wherein the image group comprises two-dimensional images acquired by a plurality of cameras; carrying out depth information estimation on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map; performing feature compression on the two-dimensional semantic feature map on the height dimension of the feature map to obtain a one-dimensional semantic feature map, and performing feature compression on the two-dimensional depth feature map to obtain a one-dimensional depth feature map; generating a two-dimensional aerial view angle feature map based on the one-dimensional semantic feature map and the one-dimensional depth feature map; and detecting the target based on the two-dimensional aerial view characteristic diagram.

Description

Target detection method, electronic device and storage medium

Technical Field

The present disclosure relates to the field of machine vision technologies, and in particular, to a target detection method, an electronic device, and a storage medium.

Background

With the continuous development of machine vision technology, 3D target detection technology is widely applied in the fields of automatic driving and robotics. Taking the autopilot field as an example, the autopilot vehicle not only needs to identify the type of the obstacle, but also needs to identify the precise position and orientation of the obstacle to provide information for the planning control module, plan out a reasonable route, and the 3D target detection aims at the ability of detecting objects such as vehicles, pedestrians, obstacles and the like through multi-sensor data such as cameras, radars, laser radars and the like, so that the autopilot vehicle can ensure the driving safety.

At present, the 3D target detection technology under the multi-camera looking-around aerial view angle has been developed rapidly recently due to the advantages of high performance and support of various task fusion such as target detection, target segmentation, lane line detection and the like. Compared with the traditional detector represented by FCOS3D, the 3D object detection technology has the main improvement that the image features are converted from a camera view angle to a bird's-eye view angle, and the fusion of multi-view features is carried out simultaneously in the step.

In the related art, when image features are converted from a camera view to a bird's-eye view, an implicit feature conversion is implemented by training a neural network model such as a transform model using a data-driven conversion scheme typified by BEVFormer. However, this kind of method requires a large amount of training data and occupies a large amount of video memory space due to the use of a huge neural network model, so that the deployment cost on the vehicle-mounted chip is high.

Disclosure of Invention

The embodiment of the application provides a target detection method, electronic equipment and a storage medium, so as to solve the technical problem that the deployment cost of a 3D target detection technology on a vehicle-mounted chip is high.

According to a first aspect of the present application, a method of object detection is disclosed, the method comprising:

extracting semantic features of images in an image group to obtain a two-dimensional semantic feature map, wherein the image group comprises two-dimensional images collected by a plurality of cameras;

carrying out depth information estimation on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map;

performing feature compression on the two-dimensional semantic feature map on the height dimension of the feature map to obtain a one-dimensional semantic feature map, and performing feature compression on the two-dimensional depth feature map to obtain a one-dimensional depth feature map;

generating a two-dimensional aerial view angle feature map based on the one-dimensional semantic feature map and the one-dimensional depth feature map;

and detecting the target based on the two-dimensional aerial view characteristic diagram.

According to a second aspect of the present application, there is disclosed an object detection apparatus, the apparatus comprising:

the extraction module is used for extracting semantic features of images in the image group to obtain a two-dimensional semantic feature map, wherein the image group comprises two-dimensional images acquired by a plurality of cameras;

the estimation module is used for carrying out depth information estimation on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map;

the compression module is used for performing feature compression on the two-dimensional semantic feature map on the height dimension of the feature map to obtain a one-dimensional semantic feature map, and performing feature compression on the two-dimensional depth feature map to obtain a one-dimensional depth feature map;

the generating module is used for generating a two-dimensional aerial view angle feature map based on the one-dimensional semantic feature map and the one-dimensional depth feature map;

and the detection module is used for detecting the target based on the two-dimensional aerial view characteristic diagram.

According to a third aspect of the present application, an electronic device is disclosed, comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the object detection method as in the first aspect.

According to a fourth aspect of the present application, a computer-readable storage medium is disclosed, having stored thereon a computer program/instructions which, when executed by a processor, implement the object detection method as in the first aspect.

According to a fifth aspect of the present application, a computer program product is disclosed, comprising computer programs/instructions which, when executed by a processor, implement the object detection method as in the first aspect.

In the embodiment of the application, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, depth information estimation is carried out on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map, feature compression is carried out on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is carried out on the two-dimensional depth feature map to obtain a one-dimensional depth feature map; and generating a two-dimensional aerial view angle characteristic map based on the one-dimensional semantic characteristic map and the one-dimensional depth characteristic map, and carrying out target detection based on the two-dimensional aerial view angle characteristic map.

Compared with the prior art, in the embodiment of the application, in the process of converting the image characteristics from the camera view angle to the bird's-eye view angle, the two-dimensional image characteristics and the two-dimensional depth information under the camera view angle can be compressed into the one-dimensional image characteristics and the one-dimensional depth information in the height dimension in consideration of the fact that the two-dimensional image has information redundancy in the height dimension, and the two-dimensional bird's-eye view angle characteristic map for 3D target detection is generated based on the one-dimensional image characteristics and the one-dimensional depth information.

Drawings

Fig. 1 is a flowchart of a target detection method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a feature compression process provided by an embodiment of the present application;

FIG. 3 is a second flowchart of a target detection method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a two-dimensional bird's eye view angle feature map generation process provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of an object detection system provided by an embodiment of the present application;

FIG. 6 is a third flowchart of a target detection method provided in the embodiments of the present application;

fig. 7 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been advanced significantly. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, specifically, a machine is used for identifying the world, and computer vision technologies generally comprise technologies such as face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to many fields, such as safety precaution, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, intelligent medical treatment, face payment, face unlocking, fingerprint unlocking, personal identification verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty cosmetics, medical beauty treatment, intelligent temperature measurement and the like.

Taking the field of automatic driving as an example, the 3D target detection technology has wide application in the field of automatic driving. Among the various sensor solutions, camera-based 3D object detection technology is gradually attracting a lot of attention due to its low cost characteristics. In the camera solution, a detection scheme (for short, look-around BEV detection) under a multi-View (that is, multi-camera) look-around Bird View (BEV) has recently been developed at a high speed due to its advantages of high performance, and support of various task fusion such as target detection, target segmentation, lane line detection, and the like. Compared with the traditional detector represented by FCOS3D, the target detection technology has the main improvement that the image features are converted from a camera view angle to a bird's-eye view angle, and the fusion of multi-view features is carried out simultaneously in the step.

In the related art, the look-around BEV detection technology mainly has two types of feature conversion schemes:

one type of method is a data-driven conversion scheme represented by BEVFormer, and implicit feature conversion is realized by training a neural network model such as a transform model, and because a huge neural network model is used, a large amount of training data is required and a large amount of video memory space is occupied, the deployment cost on a vehicle-mounted chip is high.

The other method projects two-dimensional semantic features to a three-dimensional space by using depth information obtained by prediction, and then integrates and extracts the three-dimensional features in a sampling or pooling mode to obtain two-dimensional aerial view angle features, which represent BEVDet. Although the calculation amount is reduced by the method, the operation adopted in the three-dimensional feature extraction and integration process is usually complex, and more video memory space is occupied, so that the deployment cost on the vehicle-mounted chip is high.

Therefore, in the prior art, the existing looking-around 3D target detection algorithm either introduces a complex model or introduces an operator which is not favorable for deployment in the visual angle conversion process, and occupies more video memory space, so that the mass production deployment on various vehicle-mounted chips cannot be realized.

In order to solve the above technical problems, embodiments of the present application provide a target detection method, an electronic device, and a storage medium, where view angle conversion and multi-view feature fusion are implemented by a simple and efficient method, so that the look-around BEV detection technology obtains an effect of being efficient and easy to deploy while not reducing performance, where the multi-view feature fusion is completed in a view angle conversion process.

For ease of understanding, some concepts referred to in the embodiments of the present application will be first introduced.

Bird's Eye View (BEV), also known as the "shang di View", is a visual angle or coordinate system used to describe the perceptual world, and BEV is also used to refer to an end-to-end technique in the computer vision field for converting visual information from image space to BEV space by a neural network.

The internal reference information of a camera (a camera) is a matrix used for converting from a camera coordinate system to a pixel coordinate system, and generally speaking, the internal reference of the camera is determined after the camera is shipped and can be artificially calculated by a camera calibration method.

The camera external reference information is determined by the pose (namely the placing position) of the camera, and the pose is different, and the external reference information is different and is used for being converted into the camera coordinate system from the world coordinate system.

Taking a driving scene as an example, the multiple cameras in the embodiment of the present application may also be referred to as looking around cameras or looking around cameras because they are generally arranged around the vehicle body of the vehicle, for example, 6 cameras are arranged around the vehicle body. The external reference information of each camera in the plurality of cameras is different, and the internal reference information can be the same or different.

Depth information estimation refers to obtaining distance information (also referred to as depth information) from each point in a scene in an image to a camera.

Next, a target detection method provided in an embodiment of the present application is described.

Fig. 1 is a flowchart of an object detection method provided in an embodiment of the present application, and as shown in fig. 1, the method may include the following steps: step 101, step 102, step 103, step 104 and step 105;

in step 101, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, wherein the image group includes two-dimensional images acquired by a plurality of cameras.

In the embodiment of the application, one of common image classification networks such as AlexNet, resNet, VGG, mobileNet and the like can be adopted to extract semantic features of images in an image group, wherein the semantic features can be high-level semantic features. The two-dimensional semantic profile may be represented in a tensor fashion.

In the embodiment of the present application, each image in the image group may be an image in RGB format, or may also be an image in another format, for example, an image in YUV format, which is not limited in this application.

In one example, the image group includes 6 images respectively collected by 6 cameras, and for convenience of description, the shape of the image group is expressed as 6 × 3 × 256 × 704 in a tensor form, where 6 represents the number of images, 3 represents the number of color channels, 256 represents the height of the images, and 704 represents the width of the images.

And then extracting semantic features of the images in the image group to obtain a two-dimensional semantic feature map, wherein the shape of the two-dimensional semantic feature map is 6 × 256 × 16 × 44, 6 represents the number of the images, 256 represents the number of feature channels, 16 represents the height of the feature map (the height obtained by reducing the height of the original image by 16 times), and 44 represents the width of the feature map (the width obtained by reducing the width of the original image by 16 times).

In step 102, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map.

In the embodiment of the application, the size of the two-dimensional semantic feature map may be the same as that of the two-dimensional depth feature map, and at this time, depth information estimation may be performed on each point in the two-dimensional semantic feature map to obtain the two-dimensional depth feature map. Alternatively, the size of the two-dimensional semantic feature map may be different from the size of the two-dimensional depth feature map.

In some embodiments, in order to improve the speed and the computation amount of depth estimation, depth information estimation may be performed only according to the two-dimensional semantic feature map, so as to obtain a two-dimensional depth feature map.

In some embodiments, in order to improve the accuracy of depth estimation, physical depth information of a corresponding position may be estimated from a two-dimensional semantic feature map by using camera internal reference information, and accordingly, the step 102 includes the following steps: step 1021;

in step 1021, the internal reference information of the multiple cameras is used as auxiliary reference information, and the corresponding depth information is estimated from the two-dimensional semantic feature map to obtain a two-dimensional depth feature map.

In this embodiment of the present application, a depth estimation network may be constructed by using one or more convolution layers and full join operators, and the processing procedure of step 1021 is implemented by using the constructed depth estimation network.

Therefore, in the embodiment of the application, the depth information of the two-dimensional semantic feature map can be estimated in various ways to obtain the two-dimensional depth feature map. The user can select which depth information estimation mode to adopt according to actual conditions and self requirements, and diversified requirements of the user are met.

In one example, the shape of the two-dimensional semantic feature map is 6 × 256 × 16 × 44, and depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map, where the shape of the two-dimensional depth feature map is 6 × 112 × 16, where 6 represents the number of images, 112 represents the number of depth channels in the map, 16 represents the height of the feature map, and 44 represents the width of the feature map.

In step 103, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map.

In the embodiment of the application, compared with a two-dimensional semantic feature map, a one-dimensional semantic feature map is changed into 1-dimensional in height dimension; similarly, the one-dimensional depth profile becomes 1-dimensional in the height dimension compared to the two-dimensional depth profile.

Considering that in an autonomous driving scenario, it is not possible to have multiple objects at different heights in the same direction in one image, e.g. a vehicle in front of the left, and another vehicle on this vehicle, i.e. two vehicles together, the image is considered to be information redundant in height, and thus the height dimension can be reduced from two dimensions to one dimension.

In some embodiments, a feature compression process is provided as shown in fig. 2, where "H" in fig. 2 represents a height direction, "W" represents a width direction, "C" and "D" represent a color direction and a depth direction, respectively, and accordingly, the above step 103 includes the steps of: step 1031 and step 1032;

in step 1031, maximum pooling processing is performed on the two-dimensional semantic feature map in the height dimension to obtain a one-dimensional semantic feature map.

In the embodiment of the application, a one-dimensional semantic feature map can be obtained by adopting a convolution operation mode.

In step 1032, the weight of the depth information corresponding to each point in the two-dimensional semantic feature map is predicted, the depth information of each point in the two-dimensional semantic feature map and the weight of the corresponding position are subjected to dot product operation, and the depth information and the weight are summed in the height dimension to obtain the one-dimensional depth feature map.

In the embodiment of the application, when the weight of the depth information is predicted, the two-dimensional semantic feature map can be convolved through the existing convolution network to reduce the feature dimension to 1, so that the weight of the depth information is obtained.

In consideration of the core point of 3D target detection, the detection of a target object in an image is not the detection of a background object, and the weight value of the depth of the target object in the two-dimensional semantic feature map is high, and the weight value of the depth of the background object is low, so in the embodiment of the present application, the point multiplication operation may be performed on the depth information of each point in the two-dimensional depth feature map and the weight of the corresponding position, and the sum is performed in the height dimension to obtain the one-dimensional depth feature map, and then the one-dimensional depth feature map only contains the depth of the target object and does not contain the depth of the background object, and the loss of key information is not caused by the one-dimensional depth feature map after dimension reduction.

In one example, the shape of the two-dimensional semantic feature map is 6 × 256 × 16 × 44, the shape of the two-dimensional depth feature map is 6 × 112 × 16 × 44, the two-dimensional semantic feature map is convolved to reduce the feature dimension to 1 to obtain the depth information weight, the shape of the two-dimensional depth feature map is 6 × 1 × 16 × 44, the two-dimensional depth feature map is multiplied by the weight points and summed on the height dimension to obtain the one-dimensional depth feature map, the shape of the one-dimensional depth feature map is 6 × 112 × 44, and it can be seen that the data of the height dimension in the one-dimensional depth feature map is not available; the maximum value pooling of the two-dimensional semantic feature maps is performed in the height dimension, so that the one-dimensional semantic feature map is obtained, the shape of the one-dimensional semantic feature map is 6 × 256 × 44, and it can be seen that the height dimension data in the one-dimensional semantic feature map is not available.

In the embodiment of the application, the two-dimensional semantic feature map is compressed into the one-dimensional semantic feature map, and the two-dimensional depth feature map is compressed into the one-dimensional depth feature map, so that the data volume in the conversion process can be greatly reduced, and the occupation of a video memory is reduced.

In step 104, a two-dimensional bird's-eye view angle feature map is generated based on the one-dimensional semantic feature map and the one-dimensional depth feature map.

In the embodiment of the application, in order to reduce the calculated amount and the display memory occupation amount, a two-dimensional aerial view angle feature map can be generated in a matrix operation mode based on a one-dimensional semantic feature map and a one-dimensional depth feature map. Or, in order to reduce the research and development cost, shorten the construction period, and improve the reuse rate of the technology, the one-dimensional semantic feature map and the one-dimensional depth feature map may be processed by using an existing computing network or module to generate the two-dimensional bird's-eye view angle feature map.

Therefore, in the embodiment of the application, the two-dimensional bird's-eye view angle feature map can be generated in various ways based on the one-dimensional semantic feature map and the one-dimensional depth feature map. The user can select which conversion mode to adopt according to actual conditions and self needs, satisfies user's diversified demand.

In step 105, target detection is performed based on the two-dimensional bird's eye view angle feature map.

In the embodiment of the present application, the two-dimensional bird's-eye view angle feature map generated in step 104 may be used to directly predict the 3D target detection result. For example, the two-dimensional bird's eye view feature map is convolved on an existing object detection network, such as Fbev, to obtain a 3D object detection result, wherein the object detection network may be composed of one or more convolution layers.

As can be seen from the above embodiment, in this embodiment, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map; and generating a two-dimensional aerial view angle characteristic map based on the one-dimensional semantic characteristic map and the one-dimensional depth characteristic map, and carrying out target detection based on the two-dimensional aerial view angle characteristic map.

Fig. 3 is a second flowchart of the target detection method provided in the embodiment of the present application, a two-dimensional bird's-eye view angle feature map may be generated based on a one-dimensional semantic feature map and a one-dimensional depth feature map in a matrix operation manner, and as shown in fig. 3, the method may include the following steps: step 301, step 302, step 303, step 304, and step 305;

in step 301, semantic features of images in the image group are extracted to obtain a two-dimensional semantic feature map, where the image group includes two-dimensional images acquired by a plurality of cameras.

In step 302, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map.

In step 303, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map.

Steps 301 to 303 in the embodiment of the present application are similar to steps 101 to 103 in the embodiment shown in fig. 1, and are not repeated here, and details refer to corresponding contents in the embodiment shown in fig. 1.

Considering that when the feature conversion is performed by using a matrix operation method, if the feature conversion is a two-dimensional feature, a matrix is very large in the intermediate conversion process, and the occupation of a video memory is large, in the embodiment of the present application, in order to reduce the data amount of the matrix, the two-dimensional feature may be reduced to a one-dimensional feature, and the use of the one-dimensional feature may reduce the intermediate data to a very small amount, so that the feature conversion method of the matrix becomes possible.

In step 304, based on the static transformation matrix, using the depth information in the one-dimensional depth feature map as a guide, projecting the features in the one-dimensional semantic feature map into the two-dimensional bird's-eye view feature map.

Considering that the image coordinate system is matched with the depth, the image coordinate system can be regarded as a polar coordinate system, and two dimensions of distance and direction exist, but the bird's-eye view characteristic map is a cartesian coordinate system, and the conversion between the coordinate systems exists when the one-dimensional characteristic is converted into the two-dimensional bird's-eye view angle characteristic.

In the embodiment of the application, a static transformation matrix is generated based on internal reference information and external reference information of a plurality of cameras, the static transformation matrix comprises coordinate transformation information between a polar coordinate system and a Cartesian coordinate system, and the coordinate transformation information is used for transforming semantic features and depth information under the polar coordinate system into two-dimensional bird's-eye view angle features under the Cartesian coordinate system.

In the embodiment of the present application, the size of the two-dimensional bird's-eye view angle feature map and the number of feature channels may be predefined, so that the features in the one-dimensional semantic feature map are projected onto the two-dimensional bird's-eye view angle feature map.

In this embodiment, the static transformation matrix may include: the system comprises a ring matrix and a ray matrix, wherein the ring matrix comprises depth information of a plurality of cameras, and the ray matrix comprises direction information of the plurality of cameras. The ring matrix has no direction information, and the ray matrix has direction information, so that the ring matrix and the ray matrix are required to be matched for use.

The two matrixes are generated only during the first calculation, and in the subsequent calculation, if the internal and external parameters of the camera are not changed, the matrixes do not need to be generated again.

In the embodiment of the application, in an automatic driving scene, the ring matrix can be regarded as a group of concentric rings with the vehicle as the center, and the ray matrix can be regarded as a group of rays with directions, which are emitted to the concentric rings with the vehicle as the center.

In some embodiments, the generation process of the ring matrix may include the steps of: step 401, step 402 and step 403;

in step 401, a first initial matrix with a size of L × L is constructed, and M depth values d are set, where L is the size of the two-dimensional bird's-eye view angle feature map, M is the number of channels of depth information in the one-dimensional semantic feature map, and both L and M are integers greater than 1.

In step 402, for each depth value d, at each position of the first initial matrix, if the depth from the actual coordinate corresponding to the position to the camera is equal to d, setting the value of the position to 1, otherwise, setting the value to 0, and obtaining M matrices with the size of L × L;

in step 403, M matrices with size L × L are superimposed to obtain a ring matrix.

In one example, the shape of the one-dimensional semantic feature map is 6 × 256 × 44, the shape of the one-dimensional depth feature map is 6 × 112 × 44, and the shape of the ring matrix is 112 × 16384, which is generated as follows: for each depth value d of 112 depths, at each position of the 128-128 matrix, if the actual coordinate corresponding to the position is equal to d for the depth of the camera, the value of the position is set to 1, otherwise, the value is set to 0. In this way, 112 128 by 128 matrices can be generated, superimposed and deformed (reshape) to 112 by 16384.

In some embodiments, the process of generating the ray matrix may include the steps of: step 501, step 502 and step 503;

in step 501, a second initial matrix of size L x L is constructed.

In step 502, for each view direction of each camera, at each position of the second initial matrix, if the position is in the view direction, setting the value of the position to 1, otherwise, setting the value to 0, and obtaining N matrices with the size of L × L, where N = S × T, S is the number of cameras, T is the number of view directions of a single camera, and S and T are both integers greater than 1.

In step 503, N matrices with size L × L are superimposed to obtain a matrix of rays.

In one example, the shape of the one-dimensional semantic feature map is 6 × 256 × 44, the shape of the one-dimensional depth feature map is 6 × 112 × 44, and the shape of the ray matrix is 264 × 16384, and the generation process is as follows: for each column vector c of the one-dimensional semantic feature map of 6 cameras (each camera has 44 column vectors, each column vector corresponds to one direction), at each position of the 128 × 128 matrix, if the position is on the sight line corresponding to the column vector c, the value of the position is set to 1, otherwise, the value of the position is set to 0. In this way, 44 × 6=264 128 × 128 matrices can be generated, superimposed and have their reshape 264 × 16384.

It should be noted that, besides the way of generating the ring matrix and the ray matrix in the above description, a scheme of generating the static conversion matrix by using other methods and using the static conversion matrix for the same or equivalent matrix multiplication operation is also within the protection scope of the present application.

In some embodiments, feature transformation is implemented by matching a ring matrix and a ray matrix, specifically, a ring matrix and a one-dimensional depth feature map are subjected to matrix cross multiplication (in order to embed depth information of images in an image group into the ring matrix), a result of the matrix cross multiplication is subjected to matrix point multiplication with the ray matrix (in order to obtain a direction, because the ring matrix only contains depth information but not direction information, and the ray matrix contains direction information), and finally, the matrix cross multiplication is performed with the matrix and the one-dimensional depth feature map, so as to obtain a two-dimensional bird's-eye view angle feature map, accordingly, step 304 includes the following steps: step 3041, step 3042, and step 3043;

in step 3041, a matrix cross-product operation is performed on the ring matrix and the depth information in the one-dimensional depth feature map to obtain an intermediate matrix.

In the embodiment of the application, matrix cross multiplication operation is performed on the ring matrix and the depth information in the one-dimensional depth feature map, which is essentially a weighted summation process of the ring matrix and the depth information in the one-dimensional depth feature map, and an intermediate matrix obtained after weighted summation is a ring matrix carrying the depth information of the image in the image group.

In step 3042, a matrix dot product operation is performed on the intermediate matrix and the ray matrix to obtain a projection matrix.

In the embodiment of the application, the intermediate matrix and the ray matrix are subjected to matrix dot product operation, which is essentially to perform a mask operation on the intermediate matrix and add direction information to the intermediate matrix, and the obtained projection matrix is a ring matrix carrying depth information and direction information.

In step 3043, a matrix cross-product operation is performed on the projection matrix and the features in the one-dimensional semantic feature map to obtain a two-dimensional bird's-eye view feature map.

In one example, the one-dimensional semantic feature map is transposed to 6 × 44 and reshape to 264 × 112, the one-dimensional depth feature map is transposed to 256 × 6 × 44 and reshape to 256 × 264, and the one-dimensional depth feature map and the ring matrix after reshape processing are cross-multiplied to obtain an intermediate matrix, which has a shape of 264 16384; performing dot multiplication on the intermediate matrix and the ray matrix to obtain a projection matrix, wherein the shape of the projection matrix is 264 × 16384; and performing matrix cross multiplication on the one-dimensional semantic feature map subjected to reshape processing and the projection matrix to obtain a two-dimensional aerial view angle feature map, wherein the shape of the feature map is 256 × 16384, and the reshape of the feature map is 256 × 128, namely the feature map can be represented as a two-dimensional feature.

For ease of understanding, the description of the feature conversion process is made in conjunction with the schematic diagram shown in fig. 4.

As shown in fig. 4, (1) is a ring matrix, D × (L × L) is a shape of (1), (2) is a one-dimensional depth feature map, W × D is a shape of (2), (3) is a ray matrix, W × (L × L) is a shape of (3), (4) is a one-dimensional semantic feature map, and W × C is a shape of (4), first, the ring matrix (1) and the one-dimensional depth feature map (2) are matrix-cross-multiplied to obtain an intermediate matrix (5) having a shape of W × (L × L), the intermediate matrix (5) and the ray matrix (3) are dot-multiplied to obtain a projection matrix (6) having a shape of W × (L × L), the projection matrix (6) and the one-dimensional semantic feature map (4) are matrix-cross-multiplied to obtain a two-dimensional bird's-eye view angle feature (7), and the two-dimensional bird's-eye view angle feature (7) are combined to obtain a two-dimensional bird's-eye view angle feature map (8), and a shape of C × (L × L).

In step 305, object detection is performed based on the two-dimensional bird's-eye view angle feature map.

On the premise of using ResNet-50 as a backbone network, the performance of the technical scheme of the present application and the prior art scheme on a NuScenes data set is compared, as shown in the following Table 1.

	mAP	NDS	mATE	mASE	mAOE	mAVE	mAAE	Video memory occupation
									Existing solutions	0.336	0.416	0.635	0.271	0.562	0.837	0.220	734MB
This scheme	0.336	0.415	0.653	0.271	0.473	0.903	0.231	478MB

TABLE 1

As can be seen by comparison: the technical scheme of the application is equivalent to the most advanced BEVDepth algorithm on two key indexes of mAP and NDS, and has the advantages of obviously reducing video memory occupation and being easier to deploy on chip equipment.

As can be seen from the above embodiment, in this embodiment, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map; based on the static transformation matrix, taking depth information in the one-dimensional depth feature map as guidance, and projecting features in the one-dimensional semantic feature map to the two-dimensional aerial view angle feature map; and detecting the target based on the two-dimensional aerial view characteristic diagram.

Compared with the prior art, in the embodiment of the application, in the process of converting the image characteristics from the camera view angle to the bird's-eye view angle, the two-dimensional image characteristics and the two-dimensional depth information under the camera view angle can be compressed into the one-dimensional image characteristics and the one-dimensional depth information in the height dimension in consideration of the fact that the two-dimensional image has information redundancy in the height dimension, and the two-dimensional bird's-eye view angle characteristic map for 3D target detection is generated based on the static transformation matrix, the one-dimensional image characteristics and the one-dimensional depth information. In addition, only basic matrix multiplication and feature transposition operations are needed in the conversion process, operators which are difficult to deploy or accelerate such as feature sampling or pooling are not used, and therefore the operation is easier to deploy on a vehicle-mounted chip.

Corresponding to the target detection method shown in fig. 3, an embodiment of the present application may further provide a target detection system, as shown in fig. 5, the target detection system may include the following modules: the system comprises a feature extraction module, a depth estimation module, a feature and depth compression module, a feature projection module and a target detection module; wherein the content of the first and second substances,

the feature extraction module is used for extracting high-level semantic features in an input image group, and the high-level semantic features are usually one of common image classification architectures such as AlexNet, resNet, VGG or MobileNet, the input of the feature extraction module is the image group, and the output of the feature extraction module is a two-dimensional semantic feature map.

The depth estimation module is used for estimating the physical depth of a corresponding position from the image characteristics by means of camera internal reference information, and consists of one or more convolution layers and a full-link operator, wherein the input of the module is a two-dimensional semantic characteristic map, and the output of the module is a two-dimensional depth characteristic map.

And the feature and depth compression module is used for performing convolution or height dimension pooling operation on the image feature information, predicting weights from the corresponding feature information on the depth information, performing weighted summation on the heights, inputting a two-dimensional semantic feature map and a two-dimensional depth feature map, and outputting a one-dimensional semantic feature map and a one-dimensional depth feature map.

And the feature projection module is used for projecting the features in the one-dimensional semantic feature map into the two-dimensional bird's-eye view angle feature map by using the pre-generated static transformation matrix and taking the depth in the one-dimensional depth feature map as a guide so as to form two-dimensional features under a bird's-eye view angle. Specifically, a static ring matrix and a one-dimensional depth feature map are subjected to matrix cross multiplication, the result of the matrix cross multiplication is subjected to matrix point multiplication with a static ray matrix, and finally the matrix cross multiplication is performed with the one-dimensional semantic feature map, so that the two-dimensional aerial view feature is obtained. The input of the two-dimensional bird's-eye view angle feature map is a one-dimensional semantic feature map, a one-dimensional depth feature map and a static transformation matrix, and the output of the two-dimensional bird's-eye view angle feature map is a two-dimensional bird's-eye view angle feature map.

And the target detection module is used for extracting the two-dimensional aerial view visual angle characteristics by one or more convolutions, inputting the two-dimensional aerial view visual angle characteristics as a two-dimensional aerial view visual angle characteristic map, and outputting a 3D detection result.

It can be seen that, in the embodiment of the present application, the following advantages may exist: the method has the advantages that the method is high in performance and easy to deploy, wherein the high performance is realized by utilizing one-dimensional semantic features and one-dimensional depth features to perform visual angle conversion, and compared with the existing method, the method obviously reduces video memory occupation and lowers deployment cost; the easy deployment is embodied in the absence of operators which are difficult to deploy or accelerate, such as feature sampling or pooling, and all modules are composed of basic matrix multiplication and feature transposition operations. In conclusion, the technical scheme of the application has wide application value in the fields of automatic driving and robots.

Fig. 6 is a third flowchart of the target detection method provided in the embodiment of the present application, the one-dimensional semantic feature map and the one-dimensional depth feature map may be matched with other existing computing networks or modules to implement significant reduction of video memory occupation, and as shown in fig. 6, the method may include the following steps: step 601, step 602, step 603, step 604 and step 605;

in step 601, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, where the image group includes two-dimensional images collected by multiple cameras.

In step 602, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map.

In step 603, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map.

Steps 601 to 603 in the embodiment of the present application are similar to steps 101 to 103 in the embodiment shown in fig. 1, and are not described herein again, and details are included in the content in the real-time embodiment shown in fig. 1.

In step 604, performing matrix outer product operation on the features in the one-dimensional semantic feature map and the depth information in the one-dimensional depth feature map to obtain a one-dimensional aerial view feature map; and performing feature sampling or voxel pooling on the one-dimensional aerial view characteristic map to generate a two-dimensional aerial view characteristic map.

In the embodiment of the application, matrix outer product operation is performed on the features in the one-dimensional semantic feature map and the depth information in the one-dimensional depth feature map to obtain a one-dimensional aerial view feature map with perspective effect, and then the feature map is processed by feature sampling (GridSample) or voxel pooling (VoxelPool) to obtain a two-dimensional aerial view feature map.

In one example, the shape of the one-dimensional semantic feature map is 6 × 256 × 44, the shape of the one-dimensional depth feature map is 6 × 112 × 44, and the matrix outer product operation is performed on the one-dimensional semantic feature map and the one-dimensional depth feature map to obtain a one-dimensional bird's eye view angle feature map, which is 6 × 256 × 112 × 44; and (3) performing feature sampling or voxel pooling on the one-dimensional bird's-eye view characteristic map to obtain a two-dimensional bird's-eye view characteristic map, wherein the shape of the two-dimensional bird's-eye view characteristic map is 256 × 128, 256 represents the number of characteristic channels, and 128 represents the height and the width of the characteristic map.

In step 605, object detection is performed based on the two-dimensional bird's eye view angle feature map.

As can be seen from the above embodiment, in this embodiment, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map; performing matrix outer product operation on the features in the one-dimensional semantic feature map and the depth information in the one-dimensional depth feature map to obtain a one-dimensional aerial view angle feature map; performing feature sampling or voxel pooling on the one-dimensional aerial view characteristic map to generate a two-dimensional aerial view characteristic map; and detecting the target based on the two-dimensional aerial view characteristic diagram.

Compared with the prior art, in the embodiment of the application, in the process of converting the image characteristics from the camera view angle to the bird's-eye view angle, the two-dimensional image characteristics and the two-dimensional depth information under the camera view angle can be compressed into the one-dimensional image characteristics and the one-dimensional depth information in the height dimension in consideration of the fact that the two-dimensional image characteristics and the two-dimensional depth information under the camera view angle are redundant in the height dimension, the two-dimensional bird's-eye view angle characteristic image for 3D target detection is generated based on operators such as the one-dimensional image characteristics, the one-dimensional depth information and characteristic sampling or voxel pooling, the width dimension information cannot be compressed due to the fact that only the height dimension redundant information compression processing is involved in the conversion process, and the data volume in the conversion process can be greatly reduced by converting the two-dimensional information into the one-dimensional information, so that the occupation of the display memory can be reduced on the premise that the accuracy of characteristic conversion results is guaranteed, and the cost of the deployment on the vehicle-mounted chip is reduced. In addition, the conversion process can be matched with other existing computing networks or modules, so that the research and development cost can be reduced, the construction period can be shortened, and the multiplexing rate of the technology can be improved.

Fig. 7 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application, and as shown in fig. 7, an object detection apparatus 700 may include: an extraction module 701, an estimation module 702, a compression module 703, a generation module 704 and a detection module 705;

the extraction module 701 is used for extracting semantic features of images in an image group to obtain a two-dimensional semantic feature map, wherein the image group comprises two-dimensional images acquired by a plurality of cameras;

an estimation module 702, configured to perform depth information estimation on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map;

a compression module 703, configured to perform feature compression on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and perform feature compression on the two-dimensional depth feature map to obtain a one-dimensional depth feature map;

a generating module 704, configured to generate a two-dimensional bird's-eye view angle feature map based on the one-dimensional semantic feature map and the one-dimensional depth feature map;

and a detection module 705, configured to perform target detection based on the two-dimensional bird's-eye view angle feature map.

As can be seen from the above embodiment, in this embodiment, semantic features of images in an image group are extracted to obtain a two-dimensional semantic feature map, depth information estimation is performed on the two-dimensional semantic feature map to obtain a two-dimensional depth feature map, feature compression is performed on the two-dimensional semantic feature map in the height dimension of the feature map to obtain a one-dimensional semantic feature map, and feature compression is performed on the two-dimensional depth feature map to obtain a one-dimensional depth feature map; and generating a two-dimensional aerial view angle characteristic map based on the one-dimensional semantic characteristic map and the one-dimensional depth characteristic map, and detecting the target based on the two-dimensional aerial view angle characteristic map.

Optionally, as an embodiment, the generating module 704 may include:

the projection submodule is used for projecting the features in the one-dimensional semantic feature map into the two-dimensional aerial view angle feature map by taking the depth information in the one-dimensional depth feature map as guidance based on the static transformation matrix;

the static transformation matrix is generated based on internal reference information and external reference information of the cameras, the static transformation matrix comprises coordinate transformation information between a polar coordinate system and a Cartesian coordinate system, and the coordinate transformation information is used for transforming semantic features and depth information under the polar coordinate system into two-dimensional bird's-eye view angle features under the Cartesian coordinate system.

Optionally, as an embodiment, the static transformation matrix includes: the system comprises a ring matrix and a ray matrix, wherein the ring matrix comprises depth information of the cameras, and the ray matrix comprises direction information of the cameras.

Optionally, as an embodiment, the projection sub-module may include:

the first operation unit is used for performing matrix cross multiplication operation on the ring matrix and the depth information in the one-dimensional depth characteristic diagram to obtain an intermediate matrix;

the second operation unit is used for performing matrix dot product operation on the intermediate matrix and the ray matrix to obtain a projection matrix;

and the third operation unit is used for performing matrix cross multiplication on the projection matrix and the features in the one-dimensional semantic feature map to obtain a two-dimensional aerial view feature map.

Optionally, as an embodiment, the generating process of the ring matrix may include:

constructing a first initial matrix with the size of L x L, and setting M depth values d, wherein L is the size of the two-dimensional aerial view angle feature map, M is the number of channels of depth information in the one-dimensional semantic feature map, and L and M are integers greater than 1;

for each depth value d, at each position of the first initial matrix, if the depth from the actual coordinate corresponding to the position to the camera is equal to d, setting the value of the position to be 1, otherwise, setting the value to be 0, and obtaining M matrixes with the size of L x L;

and superposing the M matrixes with the size of L x L to obtain the ring matrix.

Optionally, as an embodiment, the generating process of the ray matrix may include:

constructing a second initial matrix with the size L;

for each sight line direction of each camera, at each position of the second initial matrix, if the position is in the sight line direction, setting the value of the position to be 1, otherwise, setting the value to be 0, and obtaining N matrixes with the size of L x L, wherein N = S x T, S is the number of the cameras, T is the number of the sight line directions of a single camera, and S and T are integers more than 1;

and superposing the N matrixes with the size of L x L to obtain the ray matrix.

Optionally, as an embodiment, the generating module 704 may include:

the first generation submodule is used for performing matrix outer product operation on the features in the one-dimensional semantic feature map and the depth information in the one-dimensional depth feature map to obtain a one-dimensional aerial view angle feature map;

and the second generation sub-module is used for performing feature sampling or voxel pooling on the one-dimensional bird-eye view angle feature map to generate a two-dimensional bird-eye view angle feature map.

Optionally, as an embodiment, the compressing module 703 may include:

and the first compression submodule is used for performing maximum pooling processing on the two-dimensional semantic feature map in the height dimension to obtain a one-dimensional semantic feature map.

Optionally, as an embodiment, the compressing module 703 may include:

the prediction submodule is used for predicting the weight of the depth information corresponding to each point in the two-dimensional semantic feature map;

and the second compression submodule is used for carrying out dot product operation on the depth information of each point in the two-dimensional depth characteristic map and the weight of the corresponding position, and summing on the height dimension to obtain the one-dimensional depth characteristic map.

Optionally, as an embodiment, the estimating module 702 may include:

and the depth estimation submodule is used for estimating corresponding depth information from the two-dimensional semantic feature map by taking the internal reference information of the cameras as auxiliary reference information to obtain a two-dimensional depth feature map.

Any step and specific operation in any step in the embodiments of the object detection method provided by the present application may refer to the process of the corresponding operation described in the embodiments of the object detection method in the process of the corresponding operation performed by each module in the object detection device being performed by the corresponding module in the object detection device.

For the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

Fig. 8 is a block diagram of an electronic device according to an embodiment of the present application. The electronic device includes a processing component 822 further including one or more processors and memory resources, represented by memory 832, for storing instructions, such as application programs, that are executable by the processing component 822. The application programs stored in memory 832 may include one or more modules that each correspond to a set of instructions. Further, the processing component 822 is configured to execute instructions to perform the above-described methods.

The electronic device may also include a power component 826 configured to perform power management of the electronic device, a wired or wireless network interface 850 configured to connect the electronic device to a network, and an input/output (I/O) interface 858. The electronic device may operate based on an operating system stored in memory 832, such as Windows Server, macOS XTM, unixTM, linuxTM, freeBSDTM, or the like.

According to yet another embodiment of the present application, there is also provided a computer-readable storage medium having stored thereon a computer program/instructions which, when executed by a processor, implement the steps in the object detection method as described in any one of the above embodiments.

According to yet another embodiment of the present application, there is also provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in the object detection method according to any one of the above embodiments.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "include", "including" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such process, method, article, or terminal device. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The above detailed description is given to a target detection method, an electronic device, and a storage medium provided by the present application, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of object detection, the method comprising:

and performing target detection based on the two-dimensional aerial view angle characteristic diagram.

2. The method of claim 1, wherein generating a two-dimensional bird's eye view angle feature map based on the one-dimensional semantic feature map and the one-dimensional depth feature map comprises:

based on a static transformation matrix, taking depth information in the one-dimensional depth feature map as guidance, and projecting features in the one-dimensional semantic feature map to a two-dimensional aerial view angle feature map;

the static transformation matrix is generated based on internal reference information and external reference information of the cameras, the static transformation matrix comprises coordinate transformation information between a polar coordinate system and a Cartesian coordinate system, and the coordinate transformation information is used for transforming semantic features and depth information under the polar coordinate system into two-dimensional aerial view angle features under the Cartesian coordinate system.

3. The method of claim 2, wherein the static transformation matrix comprises: the system comprises a ring matrix and a ray matrix, wherein the ring matrix comprises depth information of the cameras, and the ray matrix comprises direction information of the cameras.

4. The method according to claim 3, wherein the projecting the features in the one-dimensional semantic feature map into the two-dimensional bird's eye view angle feature map with the depth information in the one-dimensional depth feature map as a guide based on the static transformation matrix comprises:

performing matrix cross multiplication operation on the ring matrix and the depth information in the one-dimensional depth characteristic diagram to obtain an intermediate matrix;

performing matrix dot product operation on the intermediate matrix and the ray matrix to obtain a projection matrix;

and performing matrix cross-multiplication operation on the projection matrix and the features in the one-dimensional semantic feature map to obtain a two-dimensional aerial view feature map.

5. The method of claim 3, wherein the generation of the ring matrix comprises:

constructing a first initial matrix with the size of L x L, and setting M depth values d, wherein L is the size of the two-dimensional aerial view angle feature map, M is the number of channels of depth information in the one-dimensional semantic feature map, and L and M are integers more than 1;

6. The method of claim 3, wherein the generating of the ray matrix comprises:

constructing a second initial matrix with the size L;

and superposing the N matrixes with the size of L x L to obtain the ray matrix.

7. The method of claim 1, wherein generating a two-dimensional bird's eye view angle feature map based on the one-dimensional semantic feature map and the one-dimensional depth feature map comprises:

performing matrix outer product operation on the features in the one-dimensional semantic feature map and the depth information in the one-dimensional depth feature map to obtain a one-dimensional aerial view feature map;

and performing feature sampling or voxel pooling on the one-dimensional aerial view characteristic map to generate a two-dimensional aerial view characteristic map.

8. The method according to claim 1, wherein the performing feature compression on the two-dimensional semantic feature map to obtain a one-dimensional semantic feature map comprises:

and performing maximum pooling on the two-dimensional semantic feature map in a height dimension to obtain a one-dimensional semantic feature map.

9. The method of claim 1, wherein the feature compressing the two-dimensional depth feature map to obtain a one-dimensional depth feature map comprises:

predicting the weight of the depth information corresponding to each point in the two-dimensional semantic feature map;

and performing point multiplication operation on the depth information of each point in the two-dimensional depth characteristic diagram and the weight of the corresponding position, and summing in the height dimension to obtain the one-dimensional depth characteristic diagram.

10. The method according to claim 1, wherein the estimating depth information of the two-dimensional semantic feature map to obtain a two-dimensional depth feature map comprises:

and estimating corresponding depth information from the two-dimensional semantic feature map by taking the internal reference information of the cameras as auxiliary reference information to obtain a two-dimensional depth feature map.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any of claims 1-10.

12. A computer-readable storage medium, on which a computer program/instructions is stored, characterized in that the computer program/instructions, when executed by a processor, implements the method of any of claims 1-10.

13. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method of any of claims 1-10.