CN115131787A

CN115131787A - Multi-view input aerial view semantic segmentation method and device

Info

Publication number: CN115131787A
Application number: CN202210813898.9A
Authority: CN
Inventors: 冷静; 赵天坤; 陈远鹏; 张军良
Original assignee: Hozon New Energy Automobile Co Ltd
Current assignee: Hozon New Energy Automobile Co Ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-09-30

Abstract

The invention provides a method and a device for segmenting aerial view semantics by multi-view input, wherein the method comprises the following steps: acquiring a plurality of images of a plurality of visual angles shot by one or a plurality of camera devices at the same vehicle position, and unifying the image formats of the plurality of images to extract the image characteristics of the plurality of images; reconstructing a plurality of images based on the view information, serially inputting the reconstructed image information of the plurality of views into a Transformer encoder, carrying out content and position encoding on image characteristics and a bird's-eye view grid by the Transformer encoder, and then outputting the encoded image information to a Transformer decoder, wherein the Transformer encoder and the Transformer decoder respectively comprise a Norm module, an FFN module and a cross attention module; the method comprises the steps that a Transformer decoder performs fusion of image features and projects the fusion to a bird's-eye view grid so that each grid in the bird's-eye view grid comprises a fusion feature value of a plurality of image features fused with a plurality of visual angles; and performing category judgment on the fusion characteristic values in the aerial view grid based on a cross attention mechanism so as to perform semantic segmentation.

Description

Multi-view input aerial view semantic segmentation method and device

Technical Field

The invention relates to the field of automatic driving perception, in particular to a method and a device for segmenting aerial view semantics through multi-view input.

Background

The main objective of the current automatic driving perception is to obtain semantic expressions from a plurality of sensors, fuse the semantic information, and convert the semantic expressions into a bird's-eye view coordinate system in a projection manner, so that the semantic expressions can be provided for planning control.

However, in the prior art, various approaches based on visual image processing are to detect an object or semantic information in an image coordinate system, then fuse the detection results or semantic information obtained from multiple viewing angles, and finally convert the detection results or semantic information into a bird's eye coordinate system according to the conversion of the coordinate system or the assumed information of the ground plane. In the process, for example, the final requirement of planning control is the target object under the bird's eye view, but the target object with only image visual angles obtained through the neural network cannot be end-to-end on the perception layer, and thus the conversion relationship between the two visual angles cannot be fully learned by using data; for another example, in some schemes, the process of multi-view fusion is performed on the level of detection results, so that the same characteristics of the overlapping spatial features of multiple views cannot be fully utilized, and the fusion problem of the same object cannot be solved well. In addition, the transformation from an image to a bird's eye view is generally an estimation of depth or from a geometric perspective, which is not possible to solve for occlusion or geometric unsatisfied problem scenarios.

In order to overcome the above defects in the prior art, there is a need in the art for a method and a device for multi-view-angle-input bird's-eye view semantic segmentation, which are used for realizing end-to-end multi-view fusion on a feature level and outputting a bird's-eye view semantic segmentation result.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

In order to overcome the defects in the prior art, the invention provides a bird's-eye view semantic segmentation method based on multi-view input, which comprises the following steps: acquiring a plurality of images of a plurality of visual angles shot by one or a plurality of camera devices at the same vehicle position, and unifying the image formats of the plurality of images to extract the image characteristics of the plurality of images; reconstructing the multiple images based on the view information, serially inputting the reconstructed image information of the multiple views into a transform encoder, performing content and position encoding on the image features and the bird's-eye view grids by the transform encoder, and outputting the encoded image information to a transform decoder, wherein the transform encoder and the transform decoder both comprise a Norm module, an FFN module and a cross attention module; the Transformer decoder performs fusion of the image features and projects the fusion to the bird's-eye view grids so that each grid in the bird's-eye view grids comprises a fusion feature value of a plurality of image features fused with the plurality of viewing angles; and performing category judgment on the fusion characteristic value in the aerial view grid based on a cross attention mechanism so as to perform semantic segmentation.

In an embodiment, preferably, in the method for segmenting the bird's-eye view semantic, the identifying the category of the fused feature value in the bird's-eye view grid based on the cross attention mechanism to perform semantic segmentation includes: performing category discrimination by utilizing a softmax function to execute semantic segmentation:

where Q is a feature type to be learned, K, V is a feature value of a plurality of images at the plurality of view angles after feature fusion, and dk represents a dimension size.

In an embodiment, preferably, in the method for segmenting the bird's eye view semanteme by the multi-view input provided by the present invention, unifying the image formats of the multiple images to extract the image features of the multiple images includes: each image of each view is converted into a [ B, C, H, W ] standard format, where B denotes the batch size, C denotes the image channel, H denotes the image height, and W denotes the image width.

In an embodiment, optionally, in the method for segmenting the bird's eye view semantic input through the multi-view input provided by the invention, the image channel is RGB three channels.

In an embodiment, preferably, in the method for semantically segmenting a bird's eye view inputted from multiple viewing angles provided by the present invention, the reconstructing the plurality of images based on the viewing angle information includes: and reconstructing the multiple images of the multiple visual angles in the unified format into a [ B, N, C, H, W ] format, wherein N represents the Nth visual angle.

In an embodiment, preferably, in the method for bird's-eye view semantic segmentation with multi-view input provided by the present invention, the transform encoder performs content and position coding on the image features and the bird's-eye view meshes and outputs the result to the transform decoder, and the method includes: the Transformer encoder encodes the content and position of the multiple images of the multiple views and the bird's-eye view grid, and still outputs image information with the format [ B, N C H W ] to the Transformer decoder.

In one embodiment, in the method for segmenting bird's-eye view semantics by multi-view input according to the present invention, the transform decoder performs fusion of the image features and projects the fusion to bird's-eye view meshes so that each of the bird's-eye view meshes includes a fusion feature value of a plurality of image features fused with a plurality of views, including: the Transformer decoder performs feature fusion on image features at the same position in the bird's-eye view grid by inquiring the plurality of images at the plurality of visual angles and the content and position code of the bird's-eye view grid; and outputting image information in a format of [ T M C ], wherein T represents the height of the aerial view grid, M represents the width of the aerial view grid, and C represents the fusion characteristic value of the grid.

In an embodiment, preferably, in the method for segmenting the bird's-eye view semantic input from multiple viewing angles provided by the present invention, the transform encoder performs content and position encoding on the multiple images from the multiple viewing angles and the bird's-eye view mesh, and includes: the multiple pictures of the multiple views are position-coded using the following formula:

where pos represents the coordinates in the image, 2i and 2i +1 represent the dimensions of the position code; and the position codes of a plurality of views are obtained by translation on a single view, and pos (N) ═ pos (1) of the nth view.

In one embodiment, in the bird's-eye view semantic segmentation method based on multi-view input according to the present invention, if the plurality of images are two-dimensional images, the plurality of images at the plurality of views are position-coded according to the following formula:

the invention also provides a multi-view-angle input aerial view semantic segmentation device, which comprises: a memory; and a processor coupled to the memory, the processor configured to perform the steps of the method for bird's eye view semantic segmentation of multi-view input described in any of the above.

The present invention also provides a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multi-view input bird's eye view semantic segmentation method described in any of the above.

Drawings

The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.

Fig. 1 is a schematic diagram of a model structure for performing a bird's-eye view semantic segmentation method of multi-view input according to an embodiment of the invention;

FIG. 2 is a flow chart of a method of a multi-view input bird's eye view semantic segmentation method according to an aspect of the present invention;

fig. 3 is a schematic diagram illustrating the operation principle of the transform encoder and the transform decoder in the bird's eye view semantic segmentation method with multi-view input according to an embodiment of the invention;

FIG. 4 is a schematic structural diagram illustrating a model of a transform encoder and a transform decoder in a multi-view input bird's-eye view semantic segmentation method according to an embodiment of the invention; and

fig. 5 is a schematic device structure diagram of the multi-view input bird's-eye view semantic segmentation device according to another aspect of the present invention.

For clarity, a brief description of the reference numerals is given below:

101 backbone module

102 Transformer module

103 FFN convolution module

104 bird's-eye view grid

301 Transformer encoder

302 transform decoder

303 backbone module

401 transform encoder

4011 Norm module

4012 FFN module

4013 Cross attention Module

402 transform decoder

4021 Norm module

4022 FFN module

4023 Cross attention Module

Detailed Description

The following description of the embodiments of the present invention is provided for illustrative purposes, and other advantages and effects of the present invention will become apparent to those skilled in the art from the present disclosure. While the invention will be described in connection with the preferred embodiments, there is no intent to limit its features to those embodiments. On the contrary, the invention is described in connection with the embodiments for the purpose of covering alternatives or modifications that may be extended based on the claims of the present invention. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be practiced without these particulars. Moreover, some of the specific details have been left out of the description in order to avoid obscuring or obscuring the focus of the present invention.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Additionally, the terms "upper," "lower," "left," "right," "top," "bottom," "horizontal," "vertical" and the like as used in the following description are to be understood as referring to the segment and the associated drawings in the illustrated orientation. The relative terms are used for convenience of description and do not imply that the described apparatus should be constructed or operated in the specific orientation and therefore should not be construed as limiting the invention.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, regions, layers and/or sections, these elements, regions, layers and/or sections should not be limited by these terms, but rather are used to distinguish one element, region, layer and/or section from another element, region, layer and/or section. Thus, a first component, region, layer or section discussed below could be termed a second component, region, layer or section without departing from some embodiments of the present invention.

In order to overcome the defects in the prior art, the invention provides a multi-view input bird's-eye view semantic segmentation method and device, which are used for realizing end-to-end multi-view fusion on a feature level and outputting a semantic segmentation result of a bird's-eye view, so that the similarity fusion of the same object on the feature level is well solved, the transformation of the bird's-eye view does not depend on the external parameter and plane hypothesis of cameras, the geometric external parameter between the multi-view cameras is learned through the fusion of the feature layers, and the method and device are robust to the shielding condition and have wide applicability.

Fig. 1 is a schematic model structure diagram illustrating a bird's-eye view semantic segmentation method for performing multi-view input according to an embodiment of the invention.

As shown in fig. 1, the overall model for performing the bird's-eye view semantic segmentation method of multi-view input according to the present invention can be mainly divided into a backbone module 101(backbone), a transform module 102 and an FFN convolution module 103.

The main module 101 extracts features of each view from a stack of images of multiple views, obtains a feature map of each view after stacking, and inputs the feature map of the image view to the transform module 102 for content encoding, the transform module 102 receives a rasterized grid map, that is, a bird's-eye view grid 104 for position encoding, the transform module 102 queries each grid position in the bird's-eye view to obtain a corresponding image position and a corresponding feature, and fuses image features corresponding to the same bird's-eye view grid position to obtain a bird's-eye view (BEV) feature. Finally, the FFN convolution module 103 further maps the BEV features to finally obtain fusion of the aerial view features, and category judgment is carried out through a Softmax layer to finish semantic segmentation.

Through the cooperation of the modules, the images obtained from multiple viewing angles can be projected onto the bird's-eye view grid 104, so that the bird's-eye view semantic segmentation of multi-viewing-angle input is realized. The function of each module is explained below with reference to the specific steps of the method.

Fig. 2 is a flow chart of a method of the multi-view input bird's-eye view semantic segmentation method according to an aspect of the invention.

Referring to fig. 2, the method 200 for segmenting the bird's-eye view semantic input from multiple viewing angles according to the present invention may include:

step 201: the method comprises the steps of acquiring a plurality of images of a plurality of visual angles shot by one or a plurality of camera devices at the same vehicle position, and unifying the image formats of the plurality of images to extract the image characteristics of the plurality of images.

Step 201 is mainly executed by the backbone module 101 in fig. 1, and extracts an image feature of each viewing angle from a stack of multiple viewing angle images, thereby obtaining a feature map of each viewing angle after stacking.

In an embodiment, the unifying the image formats of the plurality of images to extract the image features of the plurality of images may include: each image of each view is converted into a [ B, C, H, W ] standard format, where B denotes the batch size, C denotes the image channel, H denotes the image height, and W denotes the image width.

More specifically, the image channel C may be, for example, an RGB three-channel of pictures.

It is to be understood that the standard format of the multi-view image and the selection of the image channel are only exemplary for better illustrating the method of processing the multi-view image in the present invention, and are not intended to limit the scope of the present invention. In practical application, the picture format and the selection of the image channel required by the setting can be adjusted according to actual needs, and similar methods are all included in the protection scope of the invention.

Referring to fig. 2, the method 200 for segmenting the bird's-eye view semantic input from multiple viewing angles according to the present invention may further include:

step 202: reconstructing the plurality of images based on the view information, serially inputting the reconstructed image information of the plurality of views into a transform encoder, wherein the transform encoder performs content and position coding on the image features and the bird's-eye view grid and then outputs the image information to a transform decoder, and the transform encoder and the transform decoder respectively comprise a Norm module, an FFN module and a cross attention module.

The Transformer module can be divided into a Transformer encoder and a Transformer decoder, and can be referred to in conjunction with fig. 3.

Fig. 3 is a schematic diagram illustrating the operation principle of the transform encoder and the transform decoder in the method for segmenting the bird's eye view semantic of the multi-view input according to an embodiment of the present invention.

As shown in fig. 3, the backbone module 303 combines picture features of different view angles together and serially connects the picture features into a plurality of features, reconstructs image information based on view angle information, and serially inputs the reconstructed image information to the transform encoder 301.

In an embodiment, the reconstructing the plurality of images based on the view information may include: and reconstructing the multiple images of the multiple visual angles in the unified format into a [ B, N, C, H, W ] format, wherein N represents the Nth visual angle.

It should be understood that the reconstructed image format is also only exemplary and not intended to limit the scope of the present invention, and the reconstructed image information format may be adjusted according to actual needs in practical applications.

The Transformer encoder 301 performs content and position encoding on the image features and the bird's eye view grid based on the reconstructed image information, and may refer to fig. 4 in combination.

Fig. 4 is a schematic model structure diagram of a transform encoder and a transform decoder in the bird's-eye view semantic segmentation method for multi-view input according to an embodiment of the invention.

As shown in fig. 4, the transform encoder 401 may mainly include a Norm module 4011, an FFN module 4012, and a cross attention module 4013. The Norm module 4011 is responsible for Normalization (Normalization) or standardization, and may calculate a mean and a standard deviation of an input vector, for example, the mean and the standard deviation may be calculated for different dimensions of the same sample of the same feature. The FFN module 4012 performs feature value calculation using a Feed Forward Network (Feed Forward Network). The cross attention module 4013 may also be referred to as a Multi-head attention module (Multi-head attention), and bases the input on a Multi-head attention mechanism operation process. Through the cooperation of these modules, the image features and position information in the multi-view image are subjected to position Encoding (Positional Encoding), and the processed result is output to the transform decoder 402.

In one embodiment, the transform encoder 301 performs position coding on the multiple pictures of the multiple views by using the following formula:

Meanwhile, if the plurality of images are two-dimensional images, the plurality of images of the plurality of views are position-coded according to the following formula:

it should be noted that, in conjunction with fig. 3, the Transformer encoder 301 may still output the image information in the format of [ B, nc × H × W ] to the Transformer decoder 302 after processing the image information.

Referring to fig. 2, the bird's-eye view semantic segmentation method 200 with multi-view input provided by the present invention may further include:

step 203: the transform decoder performs fusion of the image features, and projects the fusion to the bird's-eye view meshes so that each of the bird's-eye view meshes includes a fusion feature value of a plurality of image features fused with the plurality of viewing angles.

Also in connection with fig. 4, in this embodiment, similar to the Transformer encoder 401, the Transformer decoder 402 may include a plurality of Norm modules 4021, FFN modules 4022, and cross attention modules 4023, although the configuration of cooperation between the various sub-modules in the Transformer decoder 402 is more complex.

In one embodiment, through cooperation among the sub-modules, the transform decoder performs feature fusion on image features at the same position in the bird's-eye view grid by querying the content and position codes of the multiple images at the multiple viewing angles and the bird's-eye view grid; and outputting image information in a format of [ T M C ], wherein T represents the height of the aerial view grid, M represents the width of the aerial view grid, and C represents the fusion characteristic value of the grid.

step 204: and performing category discrimination on the fused characteristic value in the aerial view grid based on a cross attention mechanism to perform semantic segmentation.

In an embodiment, the performing the category discrimination on the fused feature value in the bird's eye view grid based on the cross attention mechanism to perform semantic segmentation may include: category discrimination is performed using the softmax function to perform semantic segmentation:

wherein K, V is the feature value of the multiple images of the multiple views after the encoder fuses the features, d _k The dimension size is shown, and Q is a feature class needing to be learned.

It can be understood that after feature fusion is performed on the multi-view image, the image view feature weight of the similar position in the bird's eye view image is relatively large, and QK ^T Similarity can be calculated, features of the same position in the aerial view are fused by fusing a plurality of viewing angle image features with large similarity, finally, projection of the multi-viewing angle fusion features on the aerial view BEV is obtained, the fusion features are obtained, namely, conversion from the image viewing angle to the BEV viewing angle is carried out, and multi-viewing angle input aerial view semantic division is completedAnd (6) cutting.

Compared with other semantic segmentation schemes utilizing laser point clouds, the method for segmenting the aerial view semantic input in multiple viewing angles provided by the invention has the advantages that the image input is carried out by adopting the multi-viewing-angle camera device, the complicated laser radar equipment is not needed, and the cost is reduced. Meanwhile, the semantic features of the BEV grids are self-learned by the aid of the transform module, end-to-end multi-view fusion is realized on a feature level, conversion under multiple views can be realized, the change of the view angle of the aerial view does not depend on external parameters and plane assumptions of a camera any more, and good compatible robust characteristics are achieved for shielding situations and the like.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

As shown in fig. 5, another aspect of the present invention further provides a bird's-eye view semantic segmentation apparatus 500 with multi-view input, including: a memory 501; and a processor 502 coupled to the memory 501, the processor 502 configured to perform the steps of the method for segmenting bird's eye view semantics of multi-view input described in any one of the above.

According to another aspect of the present invention, there is also provided herein an embodiment of a computer storage medium.

The computer storage medium has a computer program stored thereon. When executed by a processor, the computer program can implement the steps of any one of the above-mentioned methods for segmenting the bird's-eye view semantic input from multiple viewing angles.

Those of skill in the art would understand that information, signals, and data may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits (bits), symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The processors described herein may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. As an example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented with a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a state machine, gated logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described throughout this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented in software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A bird's-eye view semantic segmentation method based on multi-view input comprises the following steps:

acquiring a plurality of images of a plurality of visual angles shot by one or a plurality of camera devices at the same vehicle position, and unifying the image formats of the plurality of images to extract the image characteristics of the plurality of images;

reconstructing the multiple images based on view information, and serially inputting the reconstructed image information of the multiple views into a transform encoder, wherein the transform encoder encodes the image characteristics and the bird's-eye view grid in content and position and then outputs the encoded image characteristics and the bird's-eye view grid to a transform decoder, and the transform encoder and the transform decoder respectively comprise a Norm module, an FFN module and a cross attention module;

the transform decoder performs fusion of the image features and projects the fusion to the bird's-eye view meshes so that each of the bird's-eye view meshes contains a fusion feature value of a plurality of image features fused with the plurality of viewing angles; and

performing category discrimination on the fusion characteristic values in the aerial view grid based on a cross attention mechanism to perform semantic segmentation.

2. The bird's eye view semantic segmentation method of claim 1, wherein the performing of the category discrimination on the fused feature values in the bird's eye view mesh based on a cross attention mechanism to perform semantic segmentation comprises:

performing category discrimination by utilizing a softmax function to execute semantic segmentation:

wherein Q is the feature category to be learned, K, V is the feature value of the images of the multiple visual angles after the feature is fused, d _k Representing the dimension size.

3. The bird's-eye view semantic segmentation method of claim 1, wherein unifying the image formats of the plurality of images to extract the image features of the plurality of images comprises:

each image of each view is converted into a [ B, C, H, W ] standard format, where B denotes the batch size, C denotes the image channel, H denotes the image height, and W denotes the image width.

4. The bird's eye view semantic segmentation method of claim 3, wherein the image channel is an RGB three channel.

5. The bird's eye view semantic segmentation method of claim 3, wherein the reconstructing the plurality of images based on the perspective information comprises:

and reconstructing the plurality of images of the plurality of visual angles after the unified format into a [ B, N C H W ] format, wherein N represents the Nth visual angle.

6. The bird's eye view semantic segmentation method of claim 5, wherein the transform encoder content and position encodes the image features and bird's eye view mesh and outputs the encoded image features and bird's eye view mesh to a transform decoder, comprising:

and the Transformer encoder encodes the content and the position of the plurality of images of the plurality of visual angles and the bird's-eye view grid, and still outputs image information with the format of [ B, N C H W ] to the Transformer decoder.

7. The bird's eye view semantic segmentation method of claim 6, wherein the transform decoder performs fusion of the image features and then projects the fusion onto the bird's eye view meshes such that each of the bird's eye view meshes includes a fusion feature value of a plurality of image features fused with the plurality of viewing angles, comprising:

the Transformer decoder performs feature fusion on image features at the same position in the aerial view grid by inquiring the plurality of images at the plurality of visual angles and the content and position code of the aerial view grid; and

and outputting image information with a format of [ T x M x C ], wherein T represents the height of the aerial view grid, M represents the width of the aerial view grid, and C represents the fusion characteristic value of the grid.

8. The bird's eye view semantic segmentation method of claim 6, wherein the transform encoder content and position encodes the plurality of images of the plurality of perspectives and the bird's eye view mesh, comprising:

position-coding the plurality of pictures of the plurality of views using the following formula:

where pos represents the coordinates in the image, 2i and 2i +1 represent the dimensions of the position code; and

the position coding of multiple views results from a translation on a single view, and pos (N) ═ N × pos (1) of the nth view.

9. The bird's eye view semantic segmentation method of claim 8, wherein if the plurality of images are two-dimensional images, the plurality of images at the plurality of viewing angles are position-coded according to the following formula:

10. a bird's-eye view semantic segmentation device with multi-view input comprises the following components:

a memory; and

a processor coupled to the memory, the processor configured to perform the steps of the multi-view input bird's eye view semantic segmentation method of any of claims 1-9.

11. A computer readable medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the multi-view input bird's eye view semantic segmentation method according to any one of claims 1 to 9.