CN116630912A - Three-dimensional semantic occupation prediction method, system, equipment, medium and product - Google Patents
Three-dimensional semantic occupation prediction method, system, equipment, medium and product Download PDFInfo
- Publication number
- CN116630912A CN116630912A CN202310316950.4A CN202310316950A CN116630912A CN 116630912 A CN116630912 A CN 116630912A CN 202310316950 A CN202310316950 A CN 202310316950A CN 116630912 A CN116630912 A CN 116630912A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- features
- semantic
- feature
- dimensional grid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000011176 pooling Methods 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims description 25
- 230000000007 visual effect Effects 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000003860 storage Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000005360 mashing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The disclosure relates to a three-dimensional semantic occupation prediction method, system, device, medium and product, comprising: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features of the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, carrying out pooling treatment on the initial three-dimensional grid feature to obtain plane features, and obtaining two-dimensional features representing semantic layout information according to the plane features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image according to the multi-scale three-dimensional grid characteristics. The method can restore the whole three-dimensional environment completely, and has obvious improvement on the accuracy of three-dimensional semantic occupation prediction.
Description
Technical Field
The disclosure relates to the field of automatic driving, and in particular relates to a three-dimensional semantic occupation prediction method, a system, equipment, a medium and a product.
Background
The three-dimensional semantic occupation can completely represent key information in a three-dimensional environment, a perception result is provided for an automatic driving decision, a three-view angle method (English is marked as TPVFormer) is often adopted in the prior art as an intermediate representation from a surrounding image to the three-dimensional semantic occupation, the three-view angle method adopts two-dimensional plane characteristics under three view angles as characteristic representations of the whole three-dimensional environment, however, due to the intrinsic difference in dimensions, the characteristic representations can generate obvious information loss when the visual characteristics are encoded from the multi-view image, and the loss cannot be recovered in the decoding prediction process, so that the method has poor performance in a three-dimensional semantic occupation prediction task. For example, assuming that the positive directions of the three-dimensional coordinate system X, Y, Z correspond to the right, forward and upward directions of the vehicle body, respectively, there may be a series of vehicles running bi-directionally along the X-axis in a common driving scenario, in which case the X-Y plane can sense the vehicles from a top view angle, whereas vehicles running in the same direction on the X-Z plane will overlap each other, and vehicles running in different directions on the Y-Z plane will also overlap each other, so that the resulting three-dimensional feature will contain feature information of the mashing feature on the three rays, and it is difficult to generate an accurate semantic occupancy prediction result. In addition, after the three-dimensional grid features are obtained in the prior art, a simple multi-layer sensor is often adopted to realize final semantic occupation prediction. Because the prediction of the semantic category requires higher visual information, the local prediction is difficult to use a wider range of background information, which is not beneficial to more accurate semantic occupation prediction.
Disclosure of Invention
In order to overcome the problems in the related art, the present disclosure provides a three-dimensional semantic occupation prediction method, system, device, medium, and product.
According to a first aspect of embodiments of the present disclosure, there is provided a three-dimensional semantic occupancy prediction method, including: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles;
extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features;
dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features;
fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features;
and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
In some embodiments, extracting the three-dimensional grid feature from the looking-around image to obtain an initial three-dimensional grid feature includes:
inputting the looking-around image into an image encoder to obtain a plurality of visual feature images;
inputting the visual feature images into a visual angle converter to obtain a context feature and a depth distribution feature;
and obtaining the initial three-dimensional grid feature by calculating an outer product of the context feature and the depth distribution feature and further carrying out voxel pooling.
In some embodiments, the dividing the local range of the initial three-dimensional grid feature, extracting features in the local range to obtain a three-dimensional feature, performing pooling processing on the initial three-dimensional grid feature to obtain a planar feature, and obtaining a two-dimensional feature representing semantic layout information according to the planar feature, including:
dividing the initial three-dimensional grid characteristics into grids with preset sizes in the horizontal direction, processing the grids based on a shared window attention mechanism, and extracting the characteristics in the local range to obtain three-dimensional characteristics;
and carrying out average pooling on the initial three-dimensional grid features in the height direction to obtain plane features, carrying out convolution pooling pyramid processing on the plane features through the shared window attention mechanism and the cavity space, and obtaining two-dimensional features representing semantic layout information according to the plane features.
In some embodiments, the fusing the three-dimensional feature and the two-dimensional feature results in a multi-scale three-dimensional grid feature, comprising:
and fusing the three-dimensional features and the two-dimensional features in a weighted summation mode to obtain multi-scale three-dimensional grid features, wherein the weighted summation process is guided by the weight in the height direction of the local features.
In some embodiments, predicting a three-dimensional semantic occupancy result of the look-around image based on the multi-scale three-dimensional mesh feature and a converter-based three-dimensional decoder comprises:
based on a multi-scale variable attention mechanism, interacting query features with the multi-scale three-dimensional grid features, updating the query features, and enabling the query features to pay more attention to semantic information in a current scene along with the increase of iteration times so as to output a three-dimensional semantic occupation result of a looking-around image, wherein the query features are a series of learnable feature parameters.
Further, the multi-scale variable attention mechanism is an attention mechanism with a mask such that each of the query features interacts only with the corresponding multi-scale three-dimensional grid feature.
According to a second aspect of embodiments of the present disclosure, there is provided a three-dimensional semantic occupancy prediction system comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a looking-around image, and the looking-around image comprises color images under a plurality of visual angles;
the first extraction module is used for extracting three-dimensional grid features of the looking-around image to obtain initial three-dimensional grid features;
the second extraction module is used for dividing the local range of the initial three-dimensional grid characteristics, extracting the characteristics in the local range to obtain three-dimensional characteristics, carrying out pooling treatment on the initial three-dimensional grid characteristics to obtain planar characteristics, and obtaining two-dimensional characteristics representing semantic layout information according to the planar characteristics;
the fusion module fuses the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features;
and the prediction module predicts a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
An embodiment of the third aspect of the present application provides an electronic device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, where the instruction, the program, the code set, or the instruction set is loaded and executed by the processor to implement the steps of the three-dimensional semantic occupancy prediction method provided by the embodiment of the first aspect of the present application.
An embodiment of the fourth aspect of the present application provides a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform the steps of the three-dimensional semantic occupancy prediction method provided by the above-described embodiment of the first aspect of the present application.
An embodiment of the fifth aspect of the present application provides a computer program product, which when executed by a processor of a mobile terminal, enables the mobile terminal to perform the steps of implementing the three-dimensional semantic occupancy prediction method provided by the embodiment of the first aspect of the present application.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: the application carries out two-dimensional window attention calculation on the part and the whole part in the coding part, and realizes the effective processing of the three-dimensional grid characteristics by using smaller calculation amount so as to obtain the multi-scale three-dimensional grid characteristics. In the decoding part, the query features can interact with the multi-scale three-dimensional grid features obtained by the encoder by continuously applying the deformable attention mechanism, so that the update of the query features is realized, the final query features can be decoded into three-dimensional semantic occupation results, the obtained three-dimensional semantic occupation results can restore the whole three-dimensional environment for completeness, and the three-dimensional semantic occupation prediction accuracy is obviously improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart illustrating a three-dimensional semantic occupancy prediction method according to an example embodiment.
FIG. 2 is a block diagram illustrating a three-dimensional semantic occupancy prediction system according to an example embodiment.
FIG. 3 is a visual diagram illustrating a predicted three-dimensional semantic occupancy, according to an example embodiment.
Fig. 4 is an internal structural diagram of an electronic device, which is shown according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of systems and methods that are consistent with aspects of the application as detailed in the accompanying claims.
FIG. 1 is a flow chart illustrating a three-dimensional semantic occupancy prediction method, as shown in FIG. 1, according to an exemplary embodiment, including the steps of:
in step S101, a looking-around image is obtained, wherein the looking-around image includes color images at a plurality of viewing angles.
Specifically, an annular camera is generally arranged on an automobile, and the environment around the automobile is shot through the annular camera, so that color images under a plurality of visual angles can be obtained, namely, the annular image around the automobile is obtained.
In step S102, three-dimensional grid feature extraction is performed on the looking-around image, so as to obtain an initial three-dimensional grid feature.
Specifically, in order to obtain a three-dimensional semantic occupation result of the looking-around image, three-dimensional grid feature extraction is required to be performed on the looking-around image, so that the looking-around image can be converted into three-dimensional semantic occupation.
In some embodiments, extracting the three-dimensional grid feature from the looking-around image to obtain an initial three-dimensional grid feature includes:
inputting the looking-around image into an image encoder to obtain a plurality of visual feature images;
inputting the visual feature images into a visual angle converter to obtain a context feature and a depth distribution feature;
and obtaining the initial three-dimensional grid feature by calculating an outer product of the context feature and the depth distribution feature and further carrying out voxel pooling.
Specifically, for an input looking-around image, a visual Feature map is obtained through an image encoder, the obtained visual Feature map is input into a view angle converter, the visual Feature map predicts a Context Feature (Context Feature) and a depth distribution (Depth Distribution) respectively, and the Context Feature and the depth distribution are both subjected to outer product and Voxel Pooling (Voxel Pooling) to obtain a three-dimensional grid Feature (3D Feature Volume) based on a multi-view visual Feature, namely an initial three-dimensional grid Feature, so that a preliminary description of a three-dimensional environment is established.
In step S103, the local range of the initial three-dimensional grid feature is divided, feature extraction is performed in the local range to obtain a three-dimensional feature, pooling processing is performed on the initial three-dimensional grid feature to obtain a planar feature, and a two-dimensional feature representing semantic layout information is obtained according to the planar feature.
Specifically, the initial three-dimensional grid features are processed by the three-dimensional encoder of the two-way converter for the local and global two-way branches respectively, so that the local three-dimensional features and the global two-dimensional features are acquired, and preparation is made for the subsequent acquisition of the dynamic, efficient and multi-scale three-dimensional features, wherein the three-dimensional encoder of the two-way converter consists of a plurality of continuous two-way converter modules (Dual-path Transformer Block) and convolution modules (Conv Block).
In some embodiments, the dividing the local range of the initial three-dimensional grid feature, extracting features in the local range to obtain a three-dimensional feature, performing pooling processing on the initial three-dimensional grid feature to obtain a planar feature, and obtaining a two-dimensional feature representing semantic layout information according to the planar feature, including:
dividing the initial three-dimensional grid characteristics into grids with preset sizes in the horizontal direction, processing the grids based on a shared window attention mechanism, and extracting the characteristics in the local range to obtain three-dimensional characteristics;
and carrying out average pooling on the initial three-dimensional grid features in the height direction to obtain plane features, carrying out convolution pooling pyramid processing on the plane features through the shared window attention mechanism and the cavity space, and obtaining two-dimensional features representing semantic layout information according to the plane features.
Specifically, in the local branch, the input initial three-dimensional grid feature is divided into grids with the size of 7×7 (preset-size grids) in the horizontal direction, the grids with different sizes can be divided according to different conditions by dividing the grids, a shared window attention mechanism (Windowed Attention) is adopted for processing, and feature extraction is performed in a local range, so that the local three-dimensional feature is obtained. The method comprises the steps of global branching, carrying out average pooling on initial three-dimensional features in the height direction to obtain planar features (BEV features) under the view angle of a bird's eye view, processing the planar features by a window attention mechanism module shared by parameters, and further processing by a cavity space convolution pooling pyramid (ASPP) commonly used in the field of semantic segmentation, so that semantic layout information is extracted on the receptive field of the whole scene, and global two-dimensional features are obtained.
In step S104, the three-dimensional feature and the two-dimensional feature are fused to obtain a multi-scale three-dimensional grid feature.
Specifically, by fusing the local three-dimensional features and the global two-dimensional features, the multi-scale three-dimensional grid features are obtained, and the problem that the three-dimensional features in the prior art contain mixed information on three rays is avoided, so that a more accurate semantic occupation prediction result is generated.
In some embodiments, the fusing the three-dimensional feature and the two-dimensional feature results in a multi-scale three-dimensional grid feature, comprising:
and fusing the three-dimensional features and the two-dimensional features in a weighted summation mode to obtain multi-scale three-dimensional grid features, wherein the weighted summation process is guided by the weight in the height direction of the local features.
Specifically, features extracted by the local branch and the global branch are fused in a weighted summation mode. Since the local branches and the global branches generate three-dimensional and two-dimensional features respectively, the features of the global branches need to be duplicated along the height dimension so as to realize feature addition of the same shape, and in order to enable the local and global information to be fused better, the three-dimensional features of the local branches are predicted to obtain weights in the height direction so as to guide the addition process.
In step S105, a three-dimensional semantic occupation result of the look-around image is predicted based on the multi-scale three-dimensional grid feature and a three-dimensional decoder based on a converter.
Specifically, after the multi-scale three-dimensional grid features are obtained in the mode, final three-dimensional semantic occupation result prediction is achieved on the multi-scale three-dimensional grid features through a decoder.
In some embodiments, predicting a three-dimensional semantic occupancy result of the look-around image based on the multi-scale three-dimensional mesh feature and a converter-based three-dimensional decoder comprises:
based on a multi-scale variable attention mechanism, interacting query features with the multi-scale three-dimensional grid features, updating the query features, and enabling the query features to pay more attention to semantic information in a current scene along with the increase of iteration times so as to output a three-dimensional semantic occupation result of a looking-around image, wherein the query features are a series of learnable feature parameters.
Specifically, the acquired multi-scale three-dimensional grid features are input into a three-dimensional decoder based on a converter for decoding, wherein the three-dimensional decoder based on the converter mainly comprises two parts: (1) A Multi-scale variable attention mechanism (Multi-scale 3D Deformable Attention) for fusing input three-dimensional Features (Voxel Features) (2) for iteratively updating the converter structure of Query Features (Query Features). In the first part, for input multi-scale three-dimensional grid features, interaction and fusion between multi-scale three-dimensional grid feature features is necessary because the emphasis points of different scales on the aspects including local details and overall semantics are slightly different (e.g., lower level features focus more on details and higher level features focus more on semantics). The application adopts a variable attention mechanism to realize the interaction process, specifically, each input multi-scale three-dimensional grid feature predicts a series of sampling points and sampling weights, samples corresponding feature information from corresponding positions of other three-dimensional features, and updates the features of the feature information. Through this process, the three-dimensional features at each scale may contain both local and global information. In the second part, final three-dimensional semantic occupancy prediction is implemented based on query features and attention mechanisms. Specifically, the query features are a series of learnable feature parameters, and in the process of decoding prediction, the query features are updated through multiple iterations, and in each iteration, the following operations are sequentially performed:
(1) Interaction with previously derived multi-scale three-dimensional grid features occurs through a multi-scale variable-attention mechanism.
(2) Interaction between different query features is through an attention mechanism, which is able to adequately exchange background information from different semantics, since different query features typically focus on different categories of information.
(3) The query feature updates its own features through the multi-layer perceptron, which can be understood as a general feature calculation process. After each iteration is completed, the query features can obtain category scores and corresponding three-dimensional semantic occupation results through linear layer operation, and the category scores and the corresponding three-dimensional semantic occupation results are used for the next iteration or final output results.
As shown in fig. 3, fig. 3 is a graph illustrating the visualization of three-dimensional semantic occupancy prediction based on six look-around images on nuScenes dataset according to the present application. The left upper corner of the picture is an input image, the right upper corner of the picture is a rendering result of semantic occupation under each camera view angle, and the lower part of the picture is a semantic occupation result under two global view angles, so that the prediction result can be used for better restoring the whole three-dimensional environment, and has a better recognition effect on vehicles, pedestrians, drivable areas, trees, buildings, road barriers and the like.
In some embodiments, the multi-scale variable attention mechanism is an attention mechanism with a mask such that each of the query features interacts only with the corresponding multi-scale three-dimensional grid feature.
Specifically, in the step (1), interaction through the multi-scale variable attention mechanism adopts an attention mechanism with a cover, that is, each query feature only interacts with its corresponding foreground region, if a query feature tends to identify a vehicle, then the query only interacts with a three-dimensional feature region predicted as a vehicle, and the prediction result usually comes from the prediction of the previous iteration.
FIG. 2 is a block diagram of a three-dimensional semantic occupancy prediction system, shown according to an example embodiment. Referring to fig. 2, the system includes an acquisition module 201, a first extraction module 202, a second extraction module 203, a fusion module 204, and a prediction module 205.
An acquisition module 201, configured to acquire a looking-around image, where the looking-around image includes color images under multiple viewing angles;
the first extraction module 202 performs three-dimensional grid feature extraction on the looking-around image to obtain initial three-dimensional grid features;
the second extraction module 203 divides the local range of the initial three-dimensional grid feature, performs feature extraction in the local range to obtain a three-dimensional feature, performs pooling processing on the initial three-dimensional grid feature to obtain a planar feature, and obtains a two-dimensional feature representing semantic layout information according to the planar feature;
a fusion module 204 for fusing the three-dimensional feature and the two-dimensional feature to obtain a multi-scale three-dimensional grid feature;
a prediction module 205 predicts a three-dimensional semantic occupation result of the look-around image based on the multi-scale three-dimensional grid features and a converter-based three-dimensional decoder.
The specific manner in which the various modules perform the operations in relation to the systems of the above embodiments have been described in detail in relation to the embodiments of the method and will not be described in detail herein.
In one embodiment, an electronic device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 4. The electronic device includes a processor, a memory, a communication interface, a display screen, and an input system connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, near Field Communication (NFC) or other technologies. The computer program, when executed by a processor, implements a three-dimensional semantic occupancy prediction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input system of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, the three-dimensional semantic occupancy prediction system provided by the present application may be implemented in the form of a computer program that is executable on an electronic device as shown in fig. 4. The memory of the electronic device may store the various program modules that make up the three-dimensional semantic occupancy prediction system.
At least one instruction, at least one program, a code set, or an instruction set is stored in a memory of the electronic device, and the instruction, the program, the code set, or the instruction set is loaded and executed by the processor to implement the three-dimensional semantic occupancy prediction method according to any one of the embodiments. For example, implementing a three-dimensional semantic occupancy prediction method, including: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
In one embodiment, a computer program product is provided, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform the steps of: obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles; extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features; dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features; fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features; and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium, that when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static random access memory (Static Random Access Memory, SRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features of each of the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (10)
1. A three-dimensional semantic occupancy prediction method, comprising:
obtaining a looking-around image, wherein the looking-around image comprises color images at a plurality of viewing angles;
extracting three-dimensional grid features from the looking-around image to obtain initial three-dimensional grid features;
dividing a local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features;
fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features;
and predicting a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
2. The three-dimensional semantic occupancy prediction method of claim 1, wherein performing three-dimensional mesh feature extraction on the look-around image to obtain an initial three-dimensional mesh feature comprises:
inputting the looking-around image into an image encoder to obtain a plurality of visual feature images;
inputting the visual feature images into a visual angle converter to obtain a context feature and a depth distribution feature;
and obtaining the initial three-dimensional grid feature by calculating an outer product of the context feature and the depth distribution feature and further carrying out voxel pooling.
3. The method for predicting three-dimensional semantic occupation according to claim 1, wherein the steps of dividing the local range of the initial three-dimensional grid feature, extracting features in the local range to obtain three-dimensional features, pooling the initial three-dimensional grid feature to obtain planar features, and obtaining two-dimensional features representing semantic layout information according to the planar features comprise:
dividing the initial three-dimensional grid characteristics into grids with preset sizes in the horizontal direction, processing the grids based on a shared window attention mechanism, and extracting the characteristics in the local range to obtain three-dimensional characteristics;
and carrying out average pooling on the initial three-dimensional grid features in the height direction to obtain plane features, carrying out convolution pooling pyramid processing on the plane features through the shared window attention mechanism and the cavity space, and obtaining two-dimensional features representing semantic layout information according to the plane features.
4. The method of claim 1, wherein the fusing the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features comprises:
and fusing the three-dimensional features and the two-dimensional features in a weighted summation mode to obtain multi-scale three-dimensional grid features, wherein the weighted summation process is guided by the weight in the height direction of the local features.
5. The three-dimensional semantic occupancy prediction method of claim 1, wherein predicting the three-dimensional semantic occupancy result of the look-around image based on the multi-scale three-dimensional mesh features and a converter-based three-dimensional decoder comprises:
based on a multi-scale variable attention mechanism, interacting query features with the multi-scale three-dimensional grid features, updating the query features, and enabling the query features to pay more attention to semantic information in a current scene along with the increase of iteration times so as to output a three-dimensional semantic occupation result of a looking-around image, wherein the query features are a series of learnable feature parameters.
6. The three-dimensional semantic occupancy prediction method of claim 5, wherein the multi-scale variable attention mechanism is an attention mechanism with a mask such that each of the query features interacts only with the corresponding multi-scale three-dimensional grid feature.
7. A three-dimensional semantic occupancy prediction system, comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a looking-around image, and the looking-around image comprises color images under a plurality of visual angles;
the first extraction module is used for extracting three-dimensional grid features of the looking-around image to obtain initial three-dimensional grid features;
the second extraction module is used for dividing the local range of the initial three-dimensional grid characteristics, extracting the characteristics in the local range to obtain three-dimensional characteristics, carrying out pooling treatment on the initial three-dimensional grid characteristics to obtain planar characteristics, and obtaining two-dimensional characteristics representing semantic layout information according to the planar characteristics;
the fusion module fuses the three-dimensional features and the two-dimensional features to obtain multi-scale three-dimensional grid features;
and the prediction module predicts a three-dimensional semantic occupation result of the looking-around image based on the multi-scale three-dimensional grid features and a three-dimensional decoder based on a converter.
8. An electronic device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, code set, or instruction set, the instruction, program, code set, or instruction set being loaded and executed by the processor to implement the three-dimensional semantic occupancy prediction method of any one of claims 1-6.
9. A non-transitory computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the three-dimensional semantic occupancy prediction method according to any one of claims 1-6.
10. A computer program product, characterized in that instructions in the computer program product, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the three-dimensional semantic occupancy prediction method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310316950.4A CN116630912A (en) | 2023-03-24 | 2023-03-24 | Three-dimensional semantic occupation prediction method, system, equipment, medium and product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310316950.4A CN116630912A (en) | 2023-03-24 | 2023-03-24 | Three-dimensional semantic occupation prediction method, system, equipment, medium and product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116630912A true CN116630912A (en) | 2023-08-22 |
Family
ID=87625466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310316950.4A Pending CN116630912A (en) | 2023-03-24 | 2023-03-24 | Three-dimensional semantic occupation prediction method, system, equipment, medium and product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116630912A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422629A (en) * | 2023-12-19 | 2024-01-19 | 华南理工大学 | Instance-aware monocular semantic scene completion method, medium and device |
-
2023
- 2023-03-24 CN CN202310316950.4A patent/CN116630912A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422629A (en) * | 2023-12-19 | 2024-01-19 | 华南理工大学 | Instance-aware monocular semantic scene completion method, medium and device |
CN117422629B (en) * | 2023-12-19 | 2024-04-26 | 华南理工大学 | Instance-aware monocular semantic scene completion method, medium and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110599492B (en) | Training method and device for image segmentation model, electronic equipment and storage medium | |
CN110473137B (en) | Image processing method and device | |
CN111161349B (en) | Object posture estimation method, device and equipment | |
CN114758337B (en) | Semantic instance reconstruction method, device, equipment and medium | |
CN113256778B (en) | Method, device, medium and server for generating vehicle appearance part identification sample | |
CN109003297B (en) | Monocular depth estimation method, device, terminal and storage medium | |
CN112200129B (en) | Three-dimensional target detection method and device based on deep learning and terminal equipment | |
CN111640180B (en) | Three-dimensional reconstruction method and device and terminal equipment | |
CN112580561A (en) | Target detection method and device, electronic equipment and storage medium | |
CN117422629B (en) | Instance-aware monocular semantic scene completion method, medium and device | |
CN115147598A (en) | Target detection segmentation method and device, intelligent terminal and storage medium | |
CN114219855A (en) | Point cloud normal vector estimation method and device, computer equipment and storage medium | |
CN114549338A (en) | Method and device for generating electronic map and computer readable storage medium | |
CN116630912A (en) | Three-dimensional semantic occupation prediction method, system, equipment, medium and product | |
CN115346018A (en) | Three-dimensional model reconstruction method and device and electronic equipment | |
CN116433903A (en) | Instance segmentation model construction method, system, electronic equipment and storage medium | |
CN116258859A (en) | Semantic segmentation method, semantic segmentation device, electronic equipment and storage medium | |
CN108986210B (en) | Method and device for reconstructing three-dimensional scene | |
CN117132744B (en) | Virtual scene construction method, device, medium and electronic equipment | |
CN117237623B (en) | Semantic segmentation method and system for remote sensing image of unmanned aerial vehicle | |
CN117635444A (en) | Depth completion method, device and equipment based on radiation difference and space distance | |
CN116883972A (en) | Method for determining spatial position of obstacle, training method and device for model | |
CN116912791A (en) | Target detection method, device, computer equipment and storage medium | |
CN115375715A (en) | Target extraction method and device, electronic equipment and storage medium | |
JP2024521816A (en) | Unrestricted image stabilization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |