CN116630912A

CN116630912A - Three-dimensional semantic occupation prediction method, system, equipment, medium and product

Info

Publication number: CN116630912A
Application number: CN202310316950.4A
Authority: CN
Inventors: 张云鹏; 朱政; 都大龙
Original assignee: Beijing Jianzhi Technology Co ltd
Current assignee: Beijing Jianzhi Technology Co ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-08-22

Abstract

The disclosure relates to a three-dimensional semantic occupation prediction method, system, device, medium and product, comprising: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features of the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, carrying out pooling treatment on the initial three-dimensional grid feature to obtain plane features, and obtaining two-dimensional features representing semantic layout information according to the plane features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image according to the multi-scale three-dimensional grid characteristics. The method can restore the whole three-dimensional environment completely, and has obvious improvement on the accuracy of three-dimensional semantic occupation prediction.

Description

Three-dimensional semantic occupation prediction method, system, equipment, medium and product

Technical Field

The disclosure relates to the field of automatic driving, and in particular relates to a three-dimensional semantic occupation prediction method, a system, equipment, a medium and a product.

Background

The three-dimensional semantic occupation can completely represent key information in a three-dimensional environment, a perception result is provided for an automatic driving decision, a three-view angle method (English is marked as TPVFormer) is often adopted in the prior art as an intermediate representation from a surrounding image to the three-dimensional semantic occupation, the three-view angle method adopts two-dimensional plane characteristics under three view angles as characteristic representations of the whole three-dimensional environment, however, due to the intrinsic difference in dimensions, the characteristic representations can generate obvious information loss when the visual characteristics are encoded from the multi-view image, and the loss cannot be recovered in the decoding prediction process, so that the method has poor performance in a three-dimensional semantic occupation prediction task. For example, assuming that the positive directions of the three-dimensional coordinate system X, Y, Z correspond to the right, forward and upward directions of the vehicle body, respectively, there may be a series of vehicles running bi-directionally along the X-axis in a common driving scenario, in which case the X-Y plane can sense the vehicles from a top view angle, whereas vehicles running in the same direction on the X-Z plane will overlap each other, and vehicles running in different directions on the Y-Z plane will also overlap each other, so that the resulting three-dimensional feature will contain feature information of the mashing feature on the three rays, and it is difficult to generate an accurate semantic occupancy prediction result. In addition, after the three-dimensional grid features are obtained in the prior art, a simple multi-layer sensor is often adopted to realize final semantic occupation prediction. Because the prediction of the semantic category requires higher visual information, the local prediction is difficult to use a wider range of background information, which is not beneficial to more accurate semantic occupation prediction.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a three-dimensional semantic occupation prediction method, system, device, medium, and product.

According to a first aspect of embodiments of the present disclosure, there is provided a three-dimensional semantic occupancy prediction method, including: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles;

extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features;

dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features;

fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features;

and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.

In some embodiments, extracting the three-dimensional grid feature from the looking-around image to obtain an initial three-dimensional grid feature includes:

inputting the looking-around image into an image encoder to obtain a plurality of visual feature images;

inputting the visual feature images into a visual angle converter to obtain a context feature and a depth distribution feature;

and obtaining the initial three-dimensional grid feature by calculating an outer product of the context feature and the depth distribution feature and further carrying out voxel pooling.

In some embodiments, the dividing the local range of the initial three-dimensional grid feature, extracting features in the local range to obtain a three-dimensional feature, performing pooling processing on the initial three-dimensional grid feature to obtain a planar feature, and obtaining a two-dimensional feature representing semantic layout information according to the planar feature, including:

dividing the initial three-dimensional grid characteristics into grids with preset sizes in the horizontal direction, processing the grids based on a shared window attention mechanism, and extracting the characteristics in the local range to obtain three-dimensional characteristics;

and carrying out average pooling on the initial three-dimensional grid features in the height direction to obtain plane features, carrying out convolution pooling pyramid processing on the plane features through the shared window attention mechanism and the cavity space, and obtaining two-dimensional features representing semantic layout information according to the plane features.

In some embodiments, the fusing the three-dimensional feature and the two-dimensional feature results in a multi-scale three-dimensional grid feature, comprising:

and fusing the three-dimensional features and the two-dimensional features in a weighted summation mode to obtain multi-scale three-dimensional grid features, wherein the weighted summation process is guided by the weight in the height direction of the local features.

In some embodiments, predicting a three-dimensional semantic occupancy result of the look-around image based on the multi-scale three-dimensional mesh feature and a converter-based three-dimensional decoder comprises:

based on a multi-scale variable attention mechanism, interacting query features with the multi-scale three-dimensional grid features, updating the query features, and enabling the query features to pay more attention to semantic information in a current scene along with the increase of iteration times so as to output a three-dimensional semantic occupation result of a looking-around image, wherein the query features are a series of learnable feature parameters.

Further, the multi-scale variable attention mechanism is an attention mechanism with a mask such that each of the query features interacts only with the corresponding multi-scale three-dimensional grid feature.

According to a second aspect of embodiments of the present disclosure, there is provided a three-dimensional semantic occupancy prediction system comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a looking-around image, and the looking-around image comprises color images under a plurality of visual angles;

the first extraction module is used for extracting three-dimensional grid features of the looking-around image to obtain initial three-dimensional grid features;

the second extraction module is used for dividing the local range of the initial three-dimensional grid characteristics, extracting the characteristics in the local range to obtain three-dimensional characteristics, carrying out pooling treatment on the initial three-dimensional grid characteristics to obtain planar characteristics, and obtaining two-dimensional characteristics representing semantic layout information according to the planar characteristics;

the fusion module fuses the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features;

and the prediction module predicts a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.

An embodiment of the third aspect of the present application provides an electronic device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, where the instruction, the program, the code set, or the instruction set is loaded and executed by the processor to implement the steps of the three-dimensional semantic occupancy prediction method provided by the embodiment of the first aspect of the present application.

An embodiment of the fourth aspect of the present application provides a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform the steps of the three-dimensional semantic occupancy prediction method provided by the above-described embodiment of the first aspect of the present application.

An embodiment of the fifth aspect of the present application provides a computer program product, which when executed by a processor of a mobile terminal, enables the mobile terminal to perform the steps of implementing the three-dimensional semantic occupancy prediction method provided by the embodiment of the first aspect of the present application.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: the application carries out two-dimensional window attention calculation on the part and the whole part in the coding part, and realizes the effective processing of the three-dimensional grid characteristics by using smaller calculation amount so as to obtain the multi-scale three-dimensional grid characteristics. In the decoding part, the query features can interact with the multi-scale three-dimensional grid features obtained by the encoder by continuously applying the deformable attention mechanism, so that the update of the query features is realized, the final query features can be decoded into three-dimensional semantic occupation results, the obtained three-dimensional semantic occupation results can restore the whole three-dimensional environment for completeness, and the three-dimensional semantic occupation prediction accuracy is obviously improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart illustrating a three-dimensional semantic occupancy prediction method according to an example embodiment.

FIG. 2 is a block diagram illustrating a three-dimensional semantic occupancy prediction system according to an example embodiment.

FIG. 3 is a visual diagram illustrating a predicted three-dimensional semantic occupancy, according to an example embodiment.

Fig. 4 is an internal structural diagram of an electronic device, which is shown according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of systems and methods that are consistent with aspects of the application as detailed in the accompanying claims.

FIG. 1 is a flow chart illustrating a three-dimensional semantic occupancy prediction method, as shown in FIG. 1, according to an exemplary embodiment, including the steps of:

in step S101, a looking-around image is obtained, wherein the looking-around image includes color images at a plurality of viewing angles.

Specifically, an annular camera is generally arranged on an automobile, and the environment around the automobile is shot through the annular camera, so that color images under a plurality of visual angles can be obtained, namely, the annular image around the automobile is obtained.

In step S102, three-dimensional grid feature extraction is performed on the looking-around image, so as to obtain an initial three-dimensional grid feature.

Specifically, in order to obtain a three-dimensional semantic occupation result of the looking-around image, three-dimensional grid feature extraction is required to be performed on the looking-around image, so that the looking-around image can be converted into three-dimensional semantic occupation.

Specifically, for an input looking-around image, a visual Feature map is obtained through an image encoder, the obtained visual Feature map is input into a view angle converter, the visual Feature map predicts a Context Feature (Context Feature) and a depth distribution (Depth Distribution) respectively, and the Context Feature and the depth distribution are both subjected to outer product and Voxel Pooling (Voxel Pooling) to obtain a three-dimensional grid Feature (3D Feature Volume) based on a multi-view visual Feature, namely an initial three-dimensional grid Feature, so that a preliminary description of a three-dimensional environment is established.

In step S103, the local range of the initial three-dimensional grid feature is divided, feature extraction is performed in the local range to obtain a three-dimensional feature, pooling processing is performed on the initial three-dimensional grid feature to obtain a planar feature, and a two-dimensional feature representing semantic layout information is obtained according to the planar feature.

Specifically, the initial three-dimensional grid features are processed by the three-dimensional encoder of the two-way converter for the local and global two-way branches respectively, so that the local three-dimensional features and the global two-dimensional features are acquired, and preparation is made for the subsequent acquisition of the dynamic, efficient and multi-scale three-dimensional features, wherein the three-dimensional encoder of the two-way converter consists of a plurality of continuous two-way converter modules (Dual-path Transformer Block) and convolution modules (Conv Block).

Specifically, in the local branch, the input initial three-dimensional grid feature is divided into grids with the size of 7×7 (preset-size grids) in the horizontal direction, the grids with different sizes can be divided according to different conditions by dividing the grids, a shared window attention mechanism (Windowed Attention) is adopted for processing, and feature extraction is performed in a local range, so that the local three-dimensional feature is obtained. The method comprises the steps of global branching, carrying out average pooling on initial three-dimensional features in the height direction to obtain planar features (BEV features) under the view angle of a bird's eye view, processing the planar features by a window attention mechanism module shared by parameters, and further processing by a cavity space convolution pooling pyramid (ASPP) commonly used in the field of semantic segmentation, so that semantic layout information is extracted on the receptive field of the whole scene, and global two-dimensional features are obtained.

In step S104, the three-dimensional feature and the two-dimensional feature are fused to obtain a multi-scale three-dimensional grid feature.

Specifically, by fusing the local three-dimensional features and the global two-dimensional features, the multi-scale three-dimensional grid features are obtained, and the problem that the three-dimensional features in the prior art contain mixed information on three rays is avoided, so that a more accurate semantic occupation prediction result is generated.

Specifically, features extracted by the local branch and the global branch are fused in a weighted summation mode. Since the local branches and the global branches generate three-dimensional and two-dimensional features respectively, the features of the global branches need to be duplicated along the height dimension so as to realize feature addition of the same shape, and in order to enable the local and global information to be fused better, the three-dimensional features of the local branches are predicted to obtain weights in the height direction so as to guide the addition process.

In step S105, a three-dimensional semantic occupation result of the look-around image is predicted based on the multi-scale three-dimensional grid feature and a three-dimensional decoder based on a converter.

Specifically, after the multi-scale three-dimensional grid features are obtained in the mode, final three-dimensional semantic occupation result prediction is achieved on the multi-scale three-dimensional grid features through a decoder.

Specifically, the acquired multi-scale three-dimensional grid features are input into a three-dimensional decoder based on a converter for decoding, wherein the three-dimensional decoder based on the converter mainly comprises two parts: (1) A Multi-scale variable attention mechanism (Multi-scale 3D Deformable Attention) for fusing input three-dimensional Features (Voxel Features) (2) for iteratively updating the converter structure of Query Features (Query Features). In the first part, for input multi-scale three-dimensional grid features, interaction and fusion between multi-scale three-dimensional grid feature features is necessary because the emphasis points of different scales on the aspects including local details and overall semantics are slightly different (e.g., lower level features focus more on details and higher level features focus more on semantics). The application adopts a variable attention mechanism to realize the interaction process, specifically, each input multi-scale three-dimensional grid feature predicts a series of sampling points and sampling weights, samples corresponding feature information from corresponding positions of other three-dimensional features, and updates the features of the feature information. Through this process, the three-dimensional features at each scale may contain both local and global information. In the second part, final three-dimensional semantic occupancy prediction is implemented based on query features and attention mechanisms. Specifically, the query features are a series of learnable feature parameters, and in the process of decoding prediction, the query features are updated through multiple iterations, and in each iteration, the following operations are sequentially performed:

(1) Interaction with previously derived multi-scale three-dimensional grid features occurs through a multi-scale variable-attention mechanism.

(2) Interaction between different query features is through an attention mechanism, which is able to adequately exchange background information from different semantics, since different query features typically focus on different categories of information.

(3) The query feature updates its own features through the multi-layer perceptron, which can be understood as a general feature calculation process. After each iteration is completed, the query features can obtain category scores and corresponding three-dimensional semantic occupation results through linear layer operation, and the category scores and the corresponding three-dimensional semantic occupation results are used for the next iteration or final output results.

As shown in fig. 3, fig. 3 is a graph illustrating the visualization of three-dimensional semantic occupancy prediction based on six look-around images on nuScenes dataset according to the present application. The left upper corner of the picture is an input image, the right upper corner of the picture is a rendering result of semantic occupation under each camera view angle, and the lower part of the picture is a semantic occupation result under two global view angles, so that the prediction result can be used for better restoring the whole three-dimensional environment, and has a better recognition effect on vehicles, pedestrians, drivable areas, trees, buildings, road barriers and the like.

In some embodiments, the multi-scale variable attention mechanism is an attention mechanism with a mask such that each of the query features interacts only with the corresponding multi-scale three-dimensional grid feature.

Specifically, in the step (1), interaction through the multi-scale variable attention mechanism adopts an attention mechanism with a cover, that is, each query feature only interacts with its corresponding foreground region, if a query feature tends to identify a vehicle, then the query only interacts with a three-dimensional feature region predicted as a vehicle, and the prediction result usually comes from the prediction of the previous iteration.

FIG. 2 is a block diagram of a three-dimensional semantic occupancy prediction system, shown according to an example embodiment. Referring to fig. 2, the system includes an acquisition module 201, a first extraction module 202, a second extraction module 203, a fusion module 204, and a prediction module 205.

An acquisition module 201, configured to acquire a looking-around image, where the looking-around image includes color images under multiple viewing angles;

the first extraction module 202 performs three-dimensional grid feature extraction on the looking-around image to obtain initial three-dimensional grid features;

the second extraction module 203 divides the local range of the initial three-dimensional grid feature, performs feature extraction in the local range to obtain a three-dimensional feature, performs pooling processing on the initial three-dimensional grid feature to obtain a planar feature, and obtains a two-dimensional feature representing semantic layout information according to the planar feature;

a fusion module 204 for fusing the three-dimensional feature and the two-dimensional feature to obtain a multi-scale three-dimensional grid feature;

a prediction module 205 predicts a three-dimensional semantic occupation result of the look-around image based on the multi-scale three-dimensional grid features and a converter-based three-dimensional decoder.

The specific manner in which the various modules perform the operations in relation to the systems of the above embodiments have been described in detail in relation to the embodiments of the method and will not be described in detail herein.

In one embodiment, an electronic device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 4. The electronic device includes a processor, a memory, a communication interface, a display screen, and an input system connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, near Field Communication (NFC) or other technologies. The computer program, when executed by a processor, implements a three-dimensional semantic occupancy prediction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input system of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, the three-dimensional semantic occupancy prediction system provided by the present application may be implemented in the form of a computer program that is executable on an electronic device as shown in fig. 4. The memory of the electronic device may store the various program modules that make up the three-dimensional semantic occupancy prediction system.

At least one instruction, at least one program, a code set, or an instruction set is stored in a memory of the electronic device, and the instruction, the program, the code set, or the instruction set is loaded and executed by the processor to implement the three-dimensional semantic occupancy prediction method according to any one of the embodiments. For example, implementing a three-dimensional semantic occupancy prediction method, including: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.

In one embodiment, a computer program product is provided, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform the steps of: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium, that when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static random access memory (Static Random Access Memory, SRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features of each of the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A three-dimensional semantic occupancy prediction method, comprising:

obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles;

2. The three-dimensional semantic occupancy prediction method of claim 1, wherein performing three-dimensional mesh feature extraction on the look-around image to obtain an initial three-dimensional mesh feature comprises:

3. The method for predicting three-dimensional semantic occupation according to claim 1, wherein the steps of dividing the local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features comprise:

4. The method of claim 1, wherein the fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features comprises:

5. The three-dimensional semantic occupancy prediction method of claim 1, wherein predicting the three-dimensional semantic occupancy result of the look-around image based on the multi-scale three-dimensional mesh features and a converter-based three-dimensional decoder comprises:

6. The three-dimensional semantic occupancy prediction method of claim 5, wherein the multi-scale variable attention mechanism is an attention mechanism with a mask such that each of the query features interacts only with the corresponding multi-scale three-dimensional grid feature.

7. A three-dimensional semantic occupancy prediction system, comprising:

8. An electronic device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, code set, or instruction set, the instruction, program, code set, or instruction set being loaded and executed by the processor to implement the three-dimensional semantic occupancy prediction method of any one of claims 1-6.

9. A non-transitory computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the three-dimensional semantic occupancy prediction method according to any one of claims 1-6.

10. A computer program product, characterized in that instructions in the computer program product, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the three-dimensional semantic occupancy prediction method according to any one of claims 1-6.