CN115909255A

CN115909255A - Image generation method, image segmentation method, image generation device, image segmentation device, vehicle-mounted terminal and medium

Info

Publication number: CN115909255A
Application number: CN202310010749.3A
Authority: CN
Inventors: 龚石; 叶晓青; 蒋旻悦; 谭啸; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-04-04
Anticipated expiration: 2043-01-05
Also published as: CN115909255B

Abstract

The invention provides an image generation method, an image segmentation method, an image generation device, an image segmentation device, a vehicle-mounted terminal and a medium, relates to the technical field of artificial intelligence, specifically relates to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as automatic driving, smart cities and the like. The specific implementation scheme is as follows: generating a multi-scale characteristic diagram corresponding to each visual angle according to the all-round-looking images collected under the multi-visual angles; performing cross-view conversion on the first characteristic diagram under each view angle to generate a first top view; sampling in a second feature map under at least one second resolution under each view angle to obtain the characteristics of sampling points under at least one second resolution of each grid unit in the first top view; and fusing to obtain a second top view according to the first top view and the characteristics of each sampling point. According to the technical scheme, the high-precision top view can be generated on the premise of low calculation cost.

Description

Image generation method, image segmentation method, image generation device, image segmentation device, vehicle-mounted terminal and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision, image processing, deep learning, and the like, and can be applied to scenes such as automatic driving, smart city, and the like, and in particular, to an image generation method, an image segmentation method, an image generation device, an image segmentation device, an electronic device, a vehicle-mounted terminal, and a non-transitory computer-readable storage medium.

Background

The perceptual recognition task in autonomous driving is essentially a three-dimensional geometric reconstruction of the physical world. As the diversity and number of autonomous vehicle equipped sensors become more complex, it becomes critical to characterize different viewing angles with a uniform viewing angle.

Bird's Eye View (BEV), also known as top View, is becoming more and more widely used in the field of perception and prediction for autonomous driving as a natural and direct unified representation.

When the related art obtains a high-resolution aerial view, expensive calculation cost needs to be consumed, and double requirements of people on low cost and instant perception cannot be met.

Disclosure of Invention

The present disclosure provides an image generation method, an image segmentation method, an image generation apparatus, an image segmentation apparatus, an electronic device, a vehicle-mounted terminal, and a non-transitory computer-readable storage medium.

According to an aspect of the present disclosure, there is provided an image generation method including:

generating a multi-scale feature map corresponding to each visual angle respectively according to the panoramic images collected under the multiple visual angles, wherein the multi-scale feature map comprises a first feature map under a first resolution and at least one second feature map under a second resolution, and the second resolution is higher than the first resolution;

performing cross-view conversion on the first feature map under each view angle to generate a first top view;

sampling in at least one second feature map under at least one second resolution under each view angle to obtain the characteristics of sampling points of each grid unit under at least one second resolution in the first top view;

and fusing to obtain a second top view according to the first top view and the characteristics of each sampling point.

According to another aspect of the present disclosure, there is provided an image segmentation method including:

collecting a plurality of all-around images under multiple viewing angles through a plurality of all-around cameras;

by the image generation method according to any embodiment of the disclosure, a plurality of panoramic images are fused to obtain a second top view;

and performing semantic segmentation on the second top view to obtain a category identification result of each grid unit in the second top view.

According to another aspect of the present disclosure, there is provided an image generating apparatus including:

the multi-scale characteristic diagram acquisition module is used for generating a multi-scale characteristic diagram corresponding to each view angle according to a plurality of all-around images acquired under multiple view angles, wherein the multi-scale characteristic diagram comprises a first characteristic diagram under a first resolution and at least one second characteristic diagram under a second resolution, and the second resolution is higher than the first resolution;

the first top view generating module is used for performing cross-view conversion on the first feature map under each view angle to generate a first top view;

the sampling module is used for sampling in at least one second feature map under at least one second resolution under each view angle to obtain the characteristics of sampling points of each grid unit under at least one second resolution in the first top view;

and the second top view fusion module is used for obtaining a second top view through fusion according to the first top view and the characteristics of each sampling point.

According to another aspect of the present disclosure, there is provided an image segmentation apparatus including:

the system comprises a panoramic image acquisition module, a panoramic image acquisition module and a panoramic image acquisition module, wherein the panoramic image acquisition module is used for acquiring a plurality of panoramic images under multiple visual angles through a plurality of panoramic cameras;

the fusion module is used for fusing a plurality of all-around images to obtain a second top view through the image generation method according to any embodiment of the disclosure;

and the identification module is used for performing semantic segmentation on the second top view to obtain the category identification result of each grid unit in the second top view.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an image generation method according to any one of the present disclosure.

According to another aspect of the present disclosure, there is provided a vehicle-mounted terminal including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an image segmentation method according to any one of the present disclosure

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image generation method or the image segmentation method of any one of the present disclosure.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic diagram of an image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 2a is a schematic diagram of another image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 2b is a flow chart for iteratively generating a first top view using a multi-head attention mechanism, where embodiments of the present disclosure are applicable;

FIG. 3 is a schematic diagram of yet another image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 4a is a schematic diagram of yet another image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 4b is a schematic diagram of a series connection of a set of decoding modules suitable for use in the embodiment of the present disclosure;

FIG. 4c is a schematic diagram of the logic operation in a decoding module according to an embodiment of the disclosure;

fig. 5a is a schematic diagram of an image segmentation method provided by an embodiment of the present disclosure;

FIG. 5b is a functional block diagram of an image segmentation method that can be implemented according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of an image generation apparatus provided in accordance with an embodiment of the present disclosure;

fig. 7 is a block diagram of an image segmentation apparatus provided in accordance with an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing an image generation method of an embodiment of the present disclosure, or a block diagram of a vehicle-mounted terminal for implementing an image segmentation method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of an image generation method provided according to an embodiment of the present disclosure. The embodiments of the present disclosure may be applicable to a case where a top view at a top view is generated using a plurality of panoramic views acquired from a plurality of panoramic views. The method may be performed by an image generating device, which may be implemented in hardware and/or software, and may be generally integrated in a terminal or server having a data processing function.

As shown in fig. 1, an image generation method provided by the embodiment of the present disclosure includes the following specific steps:

and S110, generating a multi-scale characteristic diagram corresponding to each visual angle according to the all-around images collected under the multi-visual angles.

The multi-scale feature map comprises a first feature map under a first resolution and at least one second feature map under a second resolution, wherein the second resolution is higher than the first resolution.

In this embodiment, the looking-around image is a plurality of images respectively collected under a plurality of looking-around viewing angles, wherein the plurality of looking-around viewing angles can be understood as taking a set fixed point as a center, and a plurality of viewing angle points are selected for shooting in continuous directions of front, back, left, right and the like of the fixed point, so as to obtain a plurality of looking-around images equivalent to looking around 360 degrees from the fixed point. Correspondingly, under a visual angle, a panoramic image with a set panoramic range can be acquired.

The multi-scale feature map can be understood as extracting the multi-scale image features of a ring-view image to obtain a plurality of feature maps under different resolutions.

Wherein the multi-scale feature map may be further subdivided into a first feature map at a first resolution and at least one second feature map at a second resolution. Wherein the first resolution may be understood as the lowest resolution at each resolution included in the multi-scale feature map. The second resolution may be understood as the full resolution above the lowest resolution at each resolution comprised in the multi-scale feature map.

In one example, if the resolution of the all-round view image a is 768 × 768, the multi-scale feature maps corresponding to the all-round view image a may include feature map 1 with a resolution of 512 × 512, feature map 2 with a resolution of 256 × 256, and feature map 3 with a resolution of 32 × 32.

In this example, the first resolution is 32 × 32, the first feature map is feature map 3, the second resolution is 512 × 512 and 256 × 256, and the second feature map is feature maps 1 and 2.

It will be appreciated that if only two feature maps at two resolutions are included in the multi-scale feature map, then the second resolution, and the number of second feature maps at the second resolution, are both unique. If the multi-scale feature map includes a plurality of feature maps at three or more resolutions, the second resolution and the number of the second feature maps at the second resolution are both a plurality, that is, the total number of resolutions included in the multi-scale feature map is-1.

And S120, performing cross-view conversion on the first characteristic diagram under each view angle to generate a first top view.

The first plan view may be an image acquired from a top view, and continuing with the fixed point as an example, the first plan view may be an image formed when the image is acquired from above the fixed point.

In this embodiment, the top view may be understood as a grid map, in which a plurality of grid cells, that is, a plurality of grids in the grid map are divided, and each grid cell corresponds to a set geographic position range. The grid map can be constructed by using the position of the fixed point as a central point, and can also be constructed according to actual longitude and latitude information. Each grid cell contains image features of one or more objects that appear within a geographic location range that matches the grid cell.

In an actual application scene, it is difficult to directly acquire the first top view, and therefore the first top view can be acquired by performing cross-view conversion on image features of a surround view image under multiple views. Specifically, the first top view may be generated by inverse perspective transformation or cross-perspective learning.

In addition, in any of the methods for generating the first top view, a relatively complicated calculation amount is required, and the calculation amount significantly increases as the resolution of the feature map of the used panoramic image increases. In this embodiment, in order to reduce the amount of computation to the maximum, the first top view is generated using the first feature map having the lowest resolution from a plurality of viewing angles. Because the resolution of the first feature map used for generating the first top view is not high, the resolution of the finally formed first top view image is not high, and it is generally difficult to meet the actual application requirements.

By way of example and not limitation, assuming that 4 ring-view images are acquired at four viewing angles respectively, a first feature map with a first resolution is extracted from each ring-view image, and a first top view matched with the first resolution can be obtained by performing cross-viewing angle conversion on the 4 first feature maps with different viewing angles.

In other words, the operation of S120 sacrifices the first top-view accuracy, which greatly saves the calculation cost. Therefore, the accuracy of the first top view needs to be compensated in a certain way.

And S130, sampling in at least one second feature map under at least one second resolution under each view angle to obtain the characteristics of the sampling points of each grid unit under at least one second resolution in the first top view.

In this embodiment, a higher-precision feature map obtained by using multi-scale feature extraction, that is, one or more second feature maps at the second resolution at each view angle is selected to perform precision compensation on the first top view.

After the first top view is obtained, the geographic position range information corresponding to each grid cell can be determined, and further, the feature of each grid cell falling into the first top view in the second feature map, that is, the feature of the sampling point, is determined according to the known mapping relationship between different viewing angles and the geographic position range.

The feature of the sampling point may be understood as a feature of a certain image point in the second feature map, or a feature of an interpolated image point obtained by interpolating the second feature map. Accordingly, the sample point features may be understood as high resolution features having a resolution higher than the resolution of the first top view.

By way of example and not limitation, assuming that 4 ring view images are acquired at four viewing angles respectively, two second feature maps at the second resolution are extracted from each ring view image, for example, the second feature map 1 at the second resolution 512 x 512 and the second feature map 2 at the second resolution 256 x 256, and sampling is performed on the 4 x 2 second feature maps (the second feature map 1 and the second feature map 2), one or more sampling point features of each grid unit at the second resolution 512 x 512 in the first top view and one or more sampling point features of each grid unit at the second resolution 256 x 256 in the first top view can be obtained respectively.

The sampling process of the characteristics of the sampling points is simple in calculation and can be obtained only through simple projection mapping or interpolation, so that the calculation amount of the process is small and the sampling process is easy to obtain.

And S140, fusing to obtain a second top view according to the first top view and the characteristics of each sampling point.

In this embodiment, after the first top view with low resolution is acquired, for each grid unit of the first top view, one or more high-precision feature points with high resolution are acquired, respectively. Furthermore, the first top view with the first resolution and the characteristics of the sampling points with high resolutions can be fused to obtain a second top view with high resolution.

It is understood that the number of grid cells included in the first top view and the second top view and the geographic location range represented by each grid cell are the same, and the difference between the two is that the resolution of the grid cell features in each grid cell in the second top view is higher, and is closer to the image features acquired from the actual top view.

The second top view may be obtained by fusion in a fixed weight fusion mode or a dynamic weight fusion mode, which is not limited in this embodiment.

According to the technical scheme of the embodiment of the disclosure, a multi-scale characteristic diagram corresponding to each visual angle is generated according to a panoramic image collected under multiple visual angles, and a first characteristic diagram under each visual angle is subjected to cross-visual angle conversion to generate a first top view; sampling in at least one second feature map under at least one second resolution under each view angle to obtain the characteristics of sampling points of each grid unit under at least one second resolution in the first top view; and according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view, generating the first top view with higher calculation cost by using the low-resolution characteristic diagram, and then fusing the high-resolution characteristics with the first top view by combining the high-resolution characteristics with lower calculation cost to obtain the second top view with high resolution. The resolution of the top view can be effectively improved with low-cost calculation.

On the basis of the foregoing embodiments, generating a multi-scale feature map corresponding to each view angle respectively according to a panoramic image acquired under multiple view angles may include:

acquiring a plurality of all-around images respectively acquired under multiple visual angles;

and respectively carrying out multi-scale feature extraction on the all-around view image under each view angle, and acquiring a feature map of each all-around view image under a plurality of resolutions as a multi-scale feature map corresponding to each view angle.

In an alternative embodiment, a Multi-Scale Feature Extractor (MSFE) may be trained in advance, and the MSFE is responsible for extracting a Multi-Scale Feature map from the all-around image at each view angle. Typically, these feature maps may be represented as { fi }, where the index i is an integer greater than 1, and the resolution of the feature map is 2i times that of the surround view image, for example, when i =5, the resolution of the feature map { f5} is 1/32 times that of the surround view image.

In an optional implementation manner of this embodiment, the value of i includes three values, i.e., 3,4, and i.e., 5. Wherein { f5} corresponds to the first profile, and { f3} and { f4} correspond to the second profiles at the second resolutions.

The MSFE may have different designs, such as a framework network ResNet using a classification model, or a Vision Transformer network (ViT) or a Pyramid Vision Transformer (PVT) based on the attention mechanism, which is not limited in this embodiment.

Through the arrangement, the feature maps under multiple resolutions can be simply and conveniently extracted at one time aiming at a single panoramic image, and the requirements of subsequently generating a first top view and acquiring the sampling point features of each grid unit under at least one second resolution in the first top view are met.

Fig. 2a is a flowchart of another image generation method provided according to an embodiment of the present disclosure. In this embodiment, the operation of "converting the first feature map at each view angle across the view angles to generate the first top view" is refined.

As shown in fig. 2a, an image generation method provided in an embodiment of the present disclosure includes the following specific steps:

s210, generating a multi-scale characteristic diagram corresponding to each visual angle according to the all-around images collected under the multi-visual angles.

And S220, generating a first global feature and a second global feature which respectively correspond to the first feature graph under each view angle.

The global feature generally refers to an overall attribute of the image, and the common global feature may include a color feature, a texture feature, and a shape feature, such as an intensity histogram.

In this embodiment, to better form the first top view, first, in the first feature map at each view angle, a first global feature and a second global feature are extracted respectively. That is, if three first feature maps at three viewing angles are obtained, one first global feature and one second global feature may be generated for each first feature map, that is, three first global features and three second global features are obtained for the three first feature maps.

In the present embodiment, the first and second global features are generated distinguishably because it is considered that the features of all the positions in each first feature map do not have the same weight when the first top view is finally generated. It is expected that the image feature at a specific position should have a larger (or smaller) weight when generating the feature of a specific grid cell in the first overhead view, and in order to mine the weight information, two types of global features, that is, a first global feature and a second global feature, may be generated for the same first feature map respectively.

Accordingly, the first top-view may be synthesized using the second global features, and the weight values of features at different locations in the second global features may be described using the first global features.

Optionally, two pre-trained global feature extractors may be constructed to extract the first global feature and the second global feature respectively in the first feature map under each view angle. Among them, considering that the fully connected network is the simplest global feature extractor, the extraction of the first global feature and the second global feature may be performed using the fully connected network.

Accordingly, generating the first and second global features respectively corresponding to the first feature maps at each view angle may include:

generating first global features respectively corresponding to the first feature maps under each view angle through a first full-connection network;

and generating second global features respectively corresponding to the first feature maps under each view angle through a second fully-connected network.

As described above, the first global feature and the second global feature play different roles in the process of generating the first top view, and therefore, the fully-connected networks with the same structure can be trained respectively based on different training targets, so as to obtain the first fully-connected network and the second fully-connected network which meet actual requirements.

Through the arrangement, two types of global features, namely the first global feature and the second global feature, corresponding to the first feature graph under each visual angle can be simply and conveniently obtained, and then the first top view meeting the requirement can be accurately generated by using the first global feature and the second global feature.

And S230, iteratively generating a first top view by adopting a Multi-Head Attention (MHA) mechanism according to the first global features, the second global features and the position coding values for describing the position relation between the image space and the top view space.

In this embodiment, in order to realize the cross-view conversion, the mapping relationship between the perspective view and the top view needs to be defined. The mapping relationship may be described by one or more position-coding values that describe the positional relationship between the image space and the top-view space.

Wherein the position-coding value may include one or more of a camera-coding value, a preset pixel-position-coding value, and a grid-cell-position-coding value in a top-view space, which correspond to each view, respectively.

After obtaining the position code values, the second global feature in each view may be mapped to a different grid cell in the first top view according to the weight determined by the first global feature.

In order to generate a more accurate first top view, the embodiments of the present disclosure may employ a multi-head attention network based on a multi-head attention mechanism, and generate the first top view through multiple iterations.

In this embodiment, since the first top view is generated by using the multi-head attention network, three important parameters, namely, a Key name parameter (Key), a Key Value parameter (Value), and a Query parameter (Query), which need to be used by the multi-head attention network need to be determined.

Aiming at the application scenario of generating the first top view, the three parameters need to be endowed with actual physical meanings to obtain a target key name parameter, a target key value parameter and a target query parameter.

The target key value parameter is used for generating a first top view, the target key name parameter is used for describing a weight value of each feature in the target key value parameter when the first top view is generated, and the target query parameter is the first top view obtained in each iteration turn. By performing a plurality of iterations based on the multi-head attention network, for example (3 times, 4 times, 5 times, etc.), a first top view that meets a predetermined accuracy requirement can be finally output.

A flowchart for iteratively generating a first top view using a multi-head attention mechanism to which the embodiments of the present disclosure are applicable is shown in fig. 2 b.

As shown in fig. 2b, iteratively generating the first top view plan according to the first global features, the second global features and the position coding value for describing the position relationship between the image space and the top view space by using a multi-head attention mechanism may include:

s2301, determining target key name parameters applied to the multi-head attention network according to the first global features corresponding to the view angles respectively, the camera coding values corresponding to the view angles respectively and preset pixel position coding values.

For each view, the sum of the first global feature corresponding to the view, the camera code value corresponding to the view, and the preset pixel position code value may be used as the target key name parameter corresponding to the view.

In a specific example, assuming that the first feature map of the all-around image at the view angle X is { f5}, the first feature map is processed by the first fully-connected layer FCk to obtain the first global feature FCk (f 5). When acquiring the panoramic image at the view angle X, assuming that a panoramic camera X1 is used, the camera code value PE1 corresponding to the view angle X is the code value corresponding to the panoramic camera X1, and the camera code value PE1 may be a learnable vector initialized at random and obtained by pre-training.

The pixel position encoding value PE2 may be a predetermined trigonometric function encoding value. Optionally, the pixel position code value PE2 has the same spatial size as the first feature map { f5 }. For example, each is 32 x 32.

Accordingly, the target key name parameter K = FCk (f 5) + PE1+ PE2 corresponding to the view angle X. It can be understood that N (N > 1) views correspond to N first global features, N camera encoded values and the same pixel position encoded value, and thus N target key name parameters can be obtained in total to correspond to N views.

S2302, determining each target key value parameter applied to the multi-head attention network according to the second global feature corresponding to each view angle.

In an optional implementation manner of this embodiment, the second global features respectively corresponding to each view may be directly used as a target key value parameter respectively.

In the previous example, it is assumed that the first feature map of the panoramic image at the view angle X is { f5}, and the first feature map is processed by the second fully-connected layer FCv to obtain the first global feature FCv (f 5), and further, the target key value parameter V = FCv (f 5) corresponding to the view angle X may be directly set.

Correspondingly, the N (N > 1) views correspond to the N second global features, and then, N target key value parameters can be obtained in total to correspond to the N views.

S2303, under the current iteration round, a first top view obtained by the previous iteration round is obtained and used as a historical top view.

In the present embodiment, the operations of S2303 to S2305 are performed respectively at each iteration round. After each iteration round is executed, a first top view under the iteration round can be obtained.

Accordingly, in the first iteration, since there is no first top view from the previous iteration, an initialization vector (e.g., an all-zero vector) may be constructed as the first historical top view. From the second iteration round, the first top view obtained from the previous iteration round can be obtained as the historical top view.

In a specific example, at the iter iteration round, the first top view Qiter-1 obtained at the (iter-1) iteration can be obtained.

S2304, calculating and obtaining target query parameters applied to the multi-head attention network according to grid unit position coding values in the historical top view and the top view space.

The encoding value of the grid unit position in the top view space may be a predetermined trigonometric function encoding value corresponding to each grid unit position in the first top view.

Alternatively, the sum of the grid cell position encoding values PE3 in the historical top view and the top view space may be used as the target query parameter Q applied to the multi-head attention network.

In one specific example, at the iter iteration round: q = PE3+ Qiter-1.

In the previous example, N first feature maps at N (N > 1) views all correspond to the same target query parameter.

And S2305, calculating to obtain a first top view under the current iteration turn by using a multi-head attention network according to the target key name parameters, the target key value parameters and the target query parameters.

In this embodiment, in the iter iteration round, the first top view Qiter in the current iteration round can be obtained by adding the N target key name parameters, the N target key value parameters, and the unique target query parameter in the N views to the multi-head attention network.

Optionally, the calculation formula of the first top view of the iter iteration may be:

Qiter=(K,V,Q)

MHA (.) is a conversion function executed by the multi-head attention network, K is the target key name parameters, V is the target key value parameters, and Q is a target query parameter.

S2306, judging whether the current iteration times reach a preset iteration time threshold: if yes, go to S2307; otherwise, go back to S2303 to start a new iteration round.

In this embodiment, in each iteration turn, the second global features weighted by the first global features at each view angle may be respectively converted into the grid cell features of each grid cell in the first top view, although the first top view is initialized to an all-zero vector, and after multiple iterations, the grid cell features of each grid cell in the first top view are closer to the top view features acquired in the real world.

Wherein the threshold of the number of iterations may be understood as the total number of iterations. That is, the number of times of repeated execution of the operations of S2303 to S2305. It can be understood that the greater the threshold of the number of iterations, the higher the accuracy of the first top view, but the higher the calculation cost, and therefore the higher the calculation cost, is required. Those skilled in the art can compromise and select an iteration threshold that satisfies both the requirements of accuracy and calculation cost according to practical situations.

And S2307, ending the iteration process and outputting a first top view.

And S240, in the second feature map under at least one second resolution under each view angle, sampling to obtain the characteristics of the sampling points of each grid unit under at least one second resolution in the first top view.

And S250, fusing to obtain a second top view according to the first top view and the characteristics of each sampling point.

The technical scheme of the embodiment of the disclosure generates a first global feature and a second global feature respectively corresponding to the first feature graph under each view angle; according to the first global features, the second global features and the position coding values used for describing the position relation between the image space and the overlook space, a multi-head attention mechanism is adopted, the first top view is generated in an iterative mode, the first top view meeting the requirements can be generated with less calculation cost by applying a mature data processing mechanism of the multi-head attention mechanism, meanwhile, the first top view meeting different precision requirements can be generated in different application scenes by simply adjusting the threshold of iteration times through a multi-iteration implementation mode, the implementation mode is simple, and the flexibility is high.

On the basis of the foregoing embodiments, after obtaining the second top view according to the first top view and the features of each sampling point by fusion, the method may further include:

In this embodiment, after the high-resolution second top view is acquired, semantic segmentation may be performed on the second top view to obtain a category identification result in each grid cell in the second top view.

Optionally, the category recognition result in each grid cell in the second overhead view may be generated by a pre-trained semantic segmentation network. The category identification result in each grid cell in the second top view may specifically include: and outputting a class probability map with the channel number C (C > 1) for each grid unit of the second top view.

The channel number C represents that the semantic segmentation network can identify C different objects. By way of example and not limitation, when C =3, the semantic segmentation network may identify three different types of objects, people, vehicles, and buildings. Further, the semantic segmentation network may output, for each mesh cell in the second top view, a shape like: human: 0.1, vehicle 0.88, building 0.02.

In this embodiment, the semantic segmentation network may be formed by sequentially connecting a first full connection layer (FC), a normalization layer (typically, batchNorm), an activation layer (typically, reLU), a second full connection layer (FC), and a logistic regression layer (typically, sigmoid) in series. Of course, the semantic segmentation model may be set according to actual requirements, which is not set in this embodiment.

Through the arrangement, various types of objects in the surrounding scene can be recognized in the second top view with high resolution in time, and further, decision information matched with the recognition result can be generated better in the field of automatic driving or intelligent monitoring, so that the actual requirements of users are met.

Fig. 3 is a schematic diagram of still another image generation method provided in an embodiment of the present disclosure, in this embodiment, the operation of "sampling in the second feature map at the at least one second resolution in each view angle to obtain the feature of the sampling point at the at least one second resolution of each grid cell in the first top view" is further refined.

As shown in fig. 3, an image generation method provided by the embodiment of the present disclosure includes the following specific steps:

and S310, generating a multi-scale characteristic diagram corresponding to each visual angle according to the all-around images collected under the multi-visual angles.

And S320, performing cross-view conversion on the first characteristic diagram under each view angle to generate a first top view.

S330, obtaining geographic area ranges respectively corresponding to each grid unit in the first top view, and respectively selecting a plurality of key points in each geographic area range.

In this embodiment, only the graphic features in a specific one or several grid cells in the top view can be acquired in consideration of the all-around image of each view angle. Based on this, in the first top view, a plurality of key points may be selected in each grid cell by taking the grid cell as a unit, and then, the key points are reversely mapped to the second feature maps, and according to the mapping positions of the key points in the second feature maps, the characteristics of the sampling points of each grid cell in the first top view at least at one second resolution may be obtained.

As described above, each grid cell in the first top view corresponds to a geographic area range set in the geographic space, and the geographic area range may be a stereo image with a set shape or a planar image with a set shape. In this embodiment, in order to ensure the dispersibility of the sampling points, a cubic shape (for example, a cylinder, a cone, a cube, or the like) corresponding to each grid cell may be constructed as the geographical region range of each grid cell, using a preset height value.

Accordingly, in an optional implementation manner of this embodiment, the obtaining the geographic area range respectively corresponding to each grid cell in the first top view may include:

acquiring a planar rectangular position range of each grid unit in a first top view; and forming a cubic area range corresponding to each grid unit according to the position range of each plane rectangle and a preset height value.

In this embodiment, since the first top view is a grid map, each grid cell is a planar rectangle, and therefore, the planar rectangle used for dividing each grid cell, or an inscribed rectangle slightly smaller than the planar rectangle, can be directly used as the planar rectangle position range of each grid cell in the first top view. Then, in order to ensure the dispersibility of different sampling points, a cubic area range corresponding to each grid cell is constructed based on a preset height value (e.g., 3 meters, 4 meters, or 5 meters, etc.).

Through the arrangement, the dispersibility of the collection of the characteristics of the sampling points can be ensured, the sampling effect is improved, and further, the fusion effect and the resolution of the second top view can be improved.

After the geographic area ranges respectively corresponding to each grid cell are obtained, it is further required that a plurality of key points for projection are respectively selected in each geographic area range. Optionally, the plurality of key points may be selected in a manner of selecting each corner point within a cubic area range, or in a manner of randomly selecting points within the whole cubic area range, and the like, which is not limited in this embodiment.

In an optional implementation manner of this embodiment, the respectively selecting a plurality of key points in each geographic area range may include:

and selecting a preset number of spherical neighborhood points as key points in each geographic area range.

The benefits of this arrangement are: by arranging the sphere in the cube corresponding to each geographic area range and selecting a plurality of neighborhood points (namely, spherical neighborhood points) relative to the sphere center (namely, grid unit center point) in the built-in sphere, a plurality of key points can be determined at the same distance from the grid unit center point as fairly as possible, and then the high-resolution sampling point characteristics of the plurality of key points can be used for enhancing the resolution of the first top view and ensuring the fusion effect of the second top view.

S340, projecting each key point in each geographic area range to the second feature map under at least one second resolution under each view angle to obtain the characteristic of the sample point of each key point under at least one second resolution.

In the present embodiment, it is assumed that the multi-scale feature map generated for each view includes a plurality of second feature maps at the second resolution. For example, the second feature map 1 with the second resolution of 512 × 512 and the second feature map 2 with the second resolution of 256 × 256. After obtaining a plurality of key points corresponding to each grid cell, each key point may be projected into the feature map 1 corresponding to each view angle and the feature map 2 corresponding to each view angle, respectively, to obtain the sampling point features of each key point for all view angles at least one second resolution.

Projecting each key point in each geographic area range into the second feature map at least at one second resolution under each view angle to obtain the sample point feature of each key point at the at least one second resolution, which may include:

acquiring the geographic position coordinates of the current key point in the current processing geographic area range; identifying at least one target view angle capable of shooting the current key point according to the geographic position coordinates; and obtaining the sampling point characteristics of the current key point under at least one second resolution according to the current projection position of the current key point in the second characteristic diagram under at least one second resolution under each target view angle.

Through the setting, the target view angle of each key point can be shot in a quick positioning mode, and then the sampling point characteristics of each key point under each second resolution ratio are accurately acquired according to the second characteristic diagram under each second resolution ratio under the target view angle.

Identifying at least one target view angle capable of shooting the current key point according to the geographic position coordinates may include:

acquiring the projection position of the current key point under each view angle according to the geographical position information and the camera projection matrix of each view angle; and if the projection position of the current key point under the current visual angle is positioned in the image range of the current visual angle, determining the current visual angle as the target visual angle.

The geographic position coordinate of the current key point may be a relative geographic position coordinate of the current key point with respect to the first top view center point, or may also be an absolute geographic position coordinate of the current key point in a geographic coordinate system, which is not limited in this embodiment.

The all-around image of each visual angle can be obtained by shooting with the all-around camera under the visual angle, and after the all-around camera fixes the shooting visual angle, the internal and external parameters of the all-around camera can be firstly calibrated, so that the camera projection matrix of each all-around camera can be obtained and used as the camera projection matrix corresponding to the visual angle shot by the all-around camera. And correspondingly multiplying the geographical position information with the camera projection matrix of each view angle respectively to obtain the projection position of the current key point under each view angle.

If the projection position of the current key point under a certain visual angle is located in the image range under the visual angle, the current key point is shot in the all-round-looking image under the visual angle. Further, the view may be taken as a target view; if the projection position of the current key point under a certain visual angle exceeds the image range under the visual angle, the current key point cannot be shot in the all-round-looking image under the visual angle, and the sampling point characteristic of the current key point is obtained in at least one second feature map under the visual angle.

It will be understood by those skilled in the art that the same current keypoint may be acquired by all around view images from multiple views (typically 2), and in this case, the current keypoint may correspond to multiple target views. Accordingly, one current keypoint may correspond to one or more target perspectives.

The benefits of this arrangement are: through simple matrix multiplication operation, the target visual angle can be quickly screened out for subsequent calculation, the calculation amount is small, and the realization is easy.

Obtaining the sampling point feature of the current keypoint at the at least one second resolution according to the current projection position of the current keypoint in the second feature map at the at least one second resolution at each target view angle, may include:

if the current projection position of the current key point in the current second feature map at the current second resolution under the current target view angle is hit by the current feature point in the current second feature map, taking the feature of the current feature point as the alternative feature of the current key point at the current second resolution;

if the current projection position of the current key point in the current second feature map does not hit any feature point in the current second feature map, interpolating to obtain the feature at the current projection position as an alternative feature of the current key point at the current second resolution;

and obtaining the sampling point characteristics of the current key point at the current second resolution according to the alternative characteristics of the current key point acquired under each target view angle.

In this embodiment, if the current projection position of the current keypoint in the current second feature map misses any feature point in the current second feature map, the current second feature map may be interpolated through a set interpolation algorithm, for example, a bilinear interpolation algorithm, to expand the feature points included in the current second feature map, so as to finally obtain the feature of the current keypoint at the current projection position in the current second feature map.

By the arrangement, whether the current key point is the existing feature point in the current second feature map or not can be used for accurately representing the sampling point feature of the current key point by using each feature point in the second feature map, and an accurate and reliable data source is provided for the subsequent fusion stage of the second top view.

Obtaining the sampling point characteristics of the current key point at the current second resolution according to the alternative characteristics acquired by the current key point at each target view angle, may include:

and if the number of the acquired alternative features is multiple, performing pooling processing on each alternative feature to obtain the sampling point feature of the current key point under the current second resolution.

In this embodiment, if the current keypoint corresponds to a plurality of target views at the current second resolution, one candidate feature may be obtained at each target view, that is, a situation where the current keypoint corresponds to a plurality of candidate features at the same current second resolution occurs. At this time, the sampling point feature of the current key point at the current second resolution may be generated jointly according to the plurality of candidate features.

The sampling point characteristics of the current key point at the current second resolution can be obtained by performing pooling processing, typically, average pooling processing on each candidate characteristic.

The benefit of this arrangement is: when the same key point appears in the area shot by the plurality of visual angles, the sampling point characteristics of the key point under each second resolution can be determined by combining the second characteristic diagrams of each second resolution under the plurality of visual angles, the loss of the sampling point characteristics is avoided to the maximum extent, and the accuracy of the sampling point characteristics is improved.

And S350, summarizing the characteristics of the sampling points of all key points belonging to the same grid unit under the same second resolution to obtain the characteristics of the sampling points of each grid unit under at least one second resolution in the first plan view.

In the present embodiment, it is assumed that there are two second resolutions in total, the second resolution 1 and the second resolution 2. The sampling point characteristics of the plurality of key points of each grid unit in the first top view at the second resolution 1 and the sampling point characteristics of the plurality of key points of each grid unit in the first top view at the second resolution 2 need to be respectively summarized.

And S360, fusing to obtain a second top view according to the first top view and the characteristics of each sampling point.

According to the technical scheme of the embodiment of the disclosure, the geographical area ranges respectively corresponding to each grid unit in the first top view are obtained, and a plurality of key points are respectively selected in each geographical area range; projecting each key point in each geographic area range to at least one second feature map under each view angle to obtain the sampling point feature of each key point under at least one second resolution; the sampling point characteristics of all key points belonging to the same grid unit under the same second resolution are collected, the mode of the sampling point characteristics of each grid unit under at least one second resolution in the first plan view is obtained, the dispersibility and the accuracy of the collection of the sampling point characteristics can be guaranteed, and then the fusion effect and the resolution of the second plan view can be effectively improved.

Fig. 4a is a schematic diagram of still another image generation method provided according to an embodiment of the present disclosure. In this embodiment, the operation of "obtaining the second top view by fusing according to the first top view and the characteristics of each sampling point" is further refined.

As shown in fig. 4a, an image generation method provided by the embodiment of the present disclosure includes the following specific steps:

and S410, generating a multi-scale characteristic map corresponding to each visual angle according to the all-around images collected under the multi-visual angles.

Wherein the multi-scale feature map comprises a first feature map at a first resolution, and at least one second feature map at a second resolution, the second resolution being higher than the first resolution;

and S420, performing cross-view conversion on the first characteristic diagram under each view angle to generate a first top view.

And S430, sampling in the second feature map under at least one second resolution under each view angle to obtain the characteristics of the sampling points under at least one second resolution of each grid unit in the first top view.

And S440, inputting the characteristics of the sampling points under the first top view and at least one second resolution into the decoding module set together.

The decoding module set is connected with a set number of decoding modules in series, and the number of decoding modules connected in series is matched with the number of the second resolution.

By way of example and not limitation, if the number of second resolutions included in the multi-scale feature map is unique, only one decoding module is included in the set of decoding modules; if the number of the second resolutions included in the multi-scale feature map is 2, the decoding module set includes two decoding modules connected in series, and so on, which is not described again.

Optionally, the jointly inputting the first top view and the characteristics of the sampling points at the at least one second resolution into the decoding module set may include:

inputting the first top view into a first decoding module in the set of decoding modules; and respectively inputting the characteristics of each sampling point under each second resolution into different decoding modules along the serial connection direction of the decoding modules according to the sequence of the second resolution from low to high.

The input of the first decoding module is a first top view obtained by performing cross-view conversion on the first feature graph under each view and each sampling point feature under the second resolution with the lowest resolution, and the output of the first decoding module is an output top view obtained by decoding input information;

the input of the second decoding module is the output top view of the first decoding module and the characteristics of each sampling point under the second resolution with the second lowest resolution, the output of the second decoding module is the output top view obtained after the input information is decoded, and the rest is repeated.

In a specific example, it is assumed that the generation of the multi-scale feature map corresponding to each view in S410 includes feature map 1 at the second resolution 512 × 512 and feature map 2 at the second resolution 256 × 256. After the processing of S430, the sampling point feature { S1} of each grid cell in the first plan view at the second resolution 512 × 512 and the sampling point feature { S2} of each grid cell in the first plan view at the second resolution 256 × 256 are obtained, and a suitable concatenation diagram of a decoding module set for the above types of sampling point features is shown in fig. 4 b.

In fig. 4b, the decoding module set includes two decoding modules in common, a decoding module 1 and a decoding module 2, the input of the decoding module 1 is a first top view and the sampling point characteristics { S2} at the second resolution 256 × 256, the output of the decoding module 1 is an output top view obtained by decoding the first top view and the { S2}, the input of the decoding module is an output top view of the decoding module 1 and the sampling point characteristics { S1} at the second resolution 512 × 512, and the output is a second top view.

Through the multi-stage decoding process, the resolution of the top view can be improved step by step, and a better fusion effect is obtained with lower calculation cost.

And S450, fusing to obtain a new top view and outputting the new top view through each decoding module according to the current input top view and the characteristics of each sampling point under the current input target second resolution.

If the decoding module is the first decoding module in the decoding module set, the current input top view of the decoding module is the first top view, and if the decoding module is not the first decoding module in the decoding module set, the current input top view of the decoding module is the output top view of the previous decoding module.

In this embodiment, the characteristics of all sampling points under the currently input target second resolution can be fused into uniform sampling point characteristics in an equal weight manner, and then the uniform sampling point characteristics and the currently input top view are fused; or, the characteristics of each sampling point under the currently input target second resolution are subjected to weight division, weighting fusion is carried out according to the weight division results, the weighting characteristics of the sampling points are obtained, and then the weighting characteristics of the sampling points and the currently input top view are fused.

In an optional implementation manner of this embodiment, the merging, by each decoding module, to obtain a new top view according to the currently input top view and features of each sampling point at the currently input target second resolution, and outputting the new top view, may include:

carrying out scale adjustment on the current input top view according to the target second resolution through each decoding module to obtain an adjusted top view; generating a first weight map according to the adjusted top view; weighting the characteristics of each sampling point under the currently input target second resolution according to the first weight graph to obtain the weighted characteristics of the sampling points; and fusing the adjusted top view and the weighting characteristics of the sampling points to obtain a new top view and output the new top view.

The benefit of this arrangement is: by generating a first weight graph corresponding to each sampling point feature under the target second resolution and using the first weight graph to perform weighting processing on each sampling point feature under the target second resolution, the finally obtained sampling point weighting features can be inclined to the sampling point features with high importance degree as much as possible, and then the new top view obtained by fusion can obtain higher resolution on the object with higher importance degree.

In this optional embodiment, the reason that the current input top view needs to be scaled according to the target second resolution is that the resolution of the current input top view is smaller than the target second resolution, and in order to ensure the scale consistency of the current input top view and the target second resolution in the subsequent fusion, the current input top view needs to be scaled first. For example, if the current input top view is the first top view with the resolution of 32 × 32 and the target second resolution is 256 × 256, the resolution of the current input top view needs to be adjusted from 32 × 32 to 256 × 256 first, and a specific adjustment manner may be interpolation zero padding or feature point copy filling, which is not limited in this embodiment.

Wherein, performing scale adjustment on the current input top view according to the target second resolution through each decoding module to obtain an adjusted top view, which may include:

and performing convolution and interpolation processing on the current input top view according to the target second resolution to obtain an adjusted top view.

Through the arrangement, the size of the current input plan view can be simply, timely and conveniently adjusted to be matched with the target second resolution, and the subsequent plan view fusion requirement is met.

Generating a first weight map according to the adjusted top view may include:

and inputting the adjusted top view into a first target full connection network (FC) and a first logistic regression network (softmax) in sequence to generate a first weight map.

In this embodiment, a first weight map for describing the feature importance degree of each sampling point can be obtained by sequentially inputting the adjusted top views with the scale matched with the target second resolution into the first target full-connection network and the first logistic regression network which are connected in series and trained in advance.

The first weight map is a set weight vector, and each vector element in the weight vector is used for describing a weight coefficient of the characteristic of each sampling point at the target second resolution. The higher the weight coefficient is, the more important the sample point feature corresponding to the weight coefficient plays in the entire top view.

Through the arrangement, the positions of the sampling point features with higher importance in the adjusted top view can be accurately and efficiently identified through the full-connection network and the logistic regression network with mature technology, and further more accurate sampling point weighting features can be obtained.

In an optional implementation manner of this embodiment, after performing weighting processing on features of each sampling point at the currently input target second resolution according to the first weight map to obtain weighted features of the sampling points, the method may further include:

and adjusting the number of the characteristic channels of the weighted characteristics of the sampling points according to the number of the characteristic channels of the adjusted top view.

In this embodiment, a feature channel may be understood as a quantitative value of a feature dimension included in one feature point. Although the scaled top view is adjusted to be consistent with the target second resolution corresponding to the sampling point feature in the scaling stage, the feature dimension (the number of feature channels) of the sampling point feature is inconsistent with the number of feature channels of the scaled top view, and the number of feature channels of the sampling point feature needs to be adjusted to be consistent in order to ensure the feasibility of fusing the two in the subsequent step.

Optionally, the number of the feature channels of the sampling point weighting feature may be adjusted to match the number of the feature channels of the adjusted top view by inputting the weighting feature of each sampling point into a full-connection network matched with the number of the feature channels of the adjusted top view.

Through the arrangement, the number of the characteristic channels of the weighting characteristics of the top view and the sampling points can be quickly aligned and adjusted, and the subsequent fusion requirements are met.

In an optional implementation manner of this embodiment, the fusing the adjusted top view and the weighting feature of the sampling point to obtain a new top view and output the new top view, which may include:

generating a first key name parameter and a first key value parameter corresponding to the adjusted top view; generating a second key name parameter and a second key value parameter corresponding to the sampling point weighting characteristics; generating a second weight map and a third weight map according to the first key name parameter and the second key name parameter; and according to the second weight map and the third weight map, performing weighted summation on the first key value parameter and the second key value parameter, and processing a weighted summation result by adopting a preset activation function to obtain a new top view and output the new top view.

In the optional embodiment, based on a Key-Value mechanism, the importance degrees of the adjusted top view and the sampling point weighting features are determined, and the adjusted top view and the sampling point weighting features are subjected to weighted fusion according to the importance degrees of the adjusted top view and the sampling point weighting features to obtain a new top view and output the new top view. Through the arrangement, the fused top view with high resolution can be accurately and efficiently obtained.

Generating a first key name parameter and a first key value parameter corresponding to the adjusted top view may include:

inputting the adjusted top view into a second target full-connection network, a first Bayesian network and a first activation network in sequence to generate a first key name parameter;

inputting the adjusted top view into a third target full-connection network and a second Bayesian network in sequence to generate a first key value parameter;

correspondingly, generating a second key name parameter and a second key value parameter corresponding to the weighted feature of the sampling point may include:

the weighted characteristics of the sampling points are sequentially input into a fourth target full-connection network, a third Bayesian network and a second activation network to generate a second key name parameter;

and sequentially inputting the weighted characteristics of the sampling points to a fifth target full-connection network and a fourth Bayesian network to generate a second key value parameter.

Further, generating the second weight map and the third weight map according to the first key name parameter and the second key name parameter may include:

performing characteristic splicing on the first key name parameter and the second key name parameter to obtain a spliced key name parameter;

inputting the splicing key name parameters into a sixth target full-connection network and a second logistic regression network in sequence to generate a combined weight graph;

and respectively extracting a second weight map and a third weight map from the combined weight map.

For ease of understanding, a schematic diagram of the logical operations in a decoding module to which the embodiments of the present disclosure are applicable is shown in fig. 4 c.

As shown in fig. 4C, it is assumed that the dimension (resolution) of the current input top view is H × W, the number of channels is C, and the input source is the right side input source in fig. 4C, the dimension of the feature of N sampling points at the current input target second resolution is N × 2h × 2w, the number of channels is C1, and the input source is the left side input source in fig. 4C. That is, the horizontal and vertical resolutions of the sampled point features are both 2 times the current input top view.

First, for the top view of H × W × C, scaling needs to be performed according to the scale of the sampling point feature, and fig. 4C shows a specific scaling manner, that is, the top view of H × W × C is sequentially input to the fully connected network (FC), the Bayesian Network (BN), the active network (ReLU), the fully connected network (FC), the Bayesian Network (BN), the Bilinear interpolation network (Bilinear) and the active network (ReLU), the resolution of the top view is doubled, and the number of channels is halved to obtain the shape: 2h × 2w × c/2.

After the top view of 2h × 2w × C/2 is sequentially passed through a full connection network (FC (C/2, N)) for adjusting the number of C/2 channels to N channels and a logistic regression network (softmax), a weight map (i.e., a first weight map) of N channels to which the sample point characteristics are adapted can be obtained

After the top view of the 2h × 2w × c/2 passes through a full connection network (FC) and a Bayesian Network (BN) in sequence, obtaining a key value parameter Φ value matched with the top view; meanwhile, the key name parameter Φ key matching the plan view is obtained by sequentially passing the plan view of 2h × 2w × c/2 through a full connection network (FC), a Bayesian Network (BN), and an activation network (ReLU).

Further, for the sampling point characteristics of N × 2h × 2w × c1, first, the sampling point characteristics of N × 2h × 2w × c1 and the obtained weighting map of the N channel are weighted and summed, and the characteristics of N sampling points are weighted into one sampling point weighting characteristic, thereby obtaining sampling point characteristics like 2h 2w × c 1.

The number of C1 channels in the sample point feature can be adjusted to the number of C/2 channels by sequentially passing the sample point feature of 2h × 2w × C1 through a full connection network (FC), a Bayesian Network (BN), and an active network (ReLU). Then, after sampling point characteristics of 2H x 2W x C/2 are sequentially input into a full connection network (FC) and a Bayesian Network (BN), a key value parameter phi value matched with the sampling point characteristics is obtained; by inputting the sampling point characteristics of 2h × 2w × c/2 to a fully connected network (FC), a Bayesian Network (BN) and an activation network (typically, reLU activation function) in sequence, the key name parameter Φ key matching the sampling point characteristics is obtained.

The key name parameter phi key matched with the top view and the key name parameter phi key matched with the sampling point feature are combined to obtain the key name parameter of the C channel, and after the key name parameter phi key and the key name parameter phi key are respectively input into a full-connection network (FC (C, 2)) for adjusting the C channel into 2 channels, the weight vectors w1 and w2 of the two channels can be obtained.

And finally, carrying out weighted summation on the key value parameter phi value matched with the top view and the key value parameter phi value matched with the sampling point characteristic and w1 and w2 respectively, and inputting a weighted summation result into an activation network (ReLU), so as to obtain a new top view output by the activation module.

S460, obtaining the output top view of the decoding module of the last bit as a second top view.

The technical scheme of the embodiment of the disclosure inputs the characteristics of each sampling point under a first top view and at least one second resolution into a decoding module set together; fusing to obtain a new top view and outputting the new top view through each decoding module according to the current input top view and the characteristics of each sampling point under the current input target second resolution; the mode of obtaining the output top view of the last decoding module as the second top view can improve the fusion effect of the second top view and ensure the resolution requirement on the second top view.

It should be noted that, in the embodiments of the present disclosure, the multi-scale feature map may be obtained by a multi-scale feature extractor, the first top view may be generated by a cross-view converter, and the second top view is obtained by fusion of a layered decoder.

Wherein, this stride visual angle converter can include: a first fully connected network, a second fully connected network, and a multi-head attention network. The hierarchical decoder may be formed of a plurality of concatenated decoding modules. Wherein, each decoding module may include: a first target fully connected network, a second target fully connected network, a third target fully connected network, a fourth target fully connected network, a fifth target fully connected network, a sixth target fully connected network, a first logistic regression network, a second logistic regression network, a first bayesian network, a second bayesian network, a third bayesian network, a fourth bayesian network, a first activation network, and a second activation network. Specifically, the multi-scale feature extractor, each network in the cross-view converter, and each network in the hierarchical decoder may be obtained by pre-training a machine learning model with a set structure by using a set training sample.

Fig. 5a is a schematic diagram of an image segmentation method according to an embodiment of the present disclosure. The embodiments of the present disclosure may be applicable to a case where a top view is generated and semantically segmented. The method may be performed by an image segmentation apparatus, which may be implemented in hardware and/or software, and may be generally integrated in a vehicle-mounted terminal.

As shown in fig. 5a, an image segmentation method provided in the embodiment of the present disclosure includes the following specific steps:

and S510, collecting a plurality of all-around images under multiple viewing angles through a plurality of all-around cameras.

In this embodiment, a plurality of panoramic cameras may be disposed on a set riding device (for example, a vehicle, a ship, an aircraft, or the like), and a plurality of panoramic images at multiple viewing angles may be acquired.

Optionally, if the riding device is a vehicle, a plurality of looking-around cameras may be disposed at different positions of a vehicle body of the vehicle, and are configured to collect a plurality of looking-around images at a plurality of viewing angles with the vehicle as a center, so as to achieve an effect of looking around the vehicle by 360 degrees.

Accordingly, the vehicle may be a general vehicle, an automatic driving vehicle, or a vehicle with a driving assistance function, and the like, which is not limited in the embodiment. Accordingly, the solution of the embodiment of the present disclosure may be applied to a general driving scenario, an automatic driving scenario, and an auxiliary driving scenario, which is not limited in this embodiment.

And S520, generating a multi-scale characteristic diagram corresponding to each visual angle according to the all-around images collected under the multi-visual angles.

S530, converting the first characteristic diagram under each visual angle in a cross-visual-angle mode to generate a first top view.

And S540, in the second feature map under at least one second resolution under each view angle, sampling to obtain the characteristics of the sampling points of each grid unit under at least one second resolution in the first top view.

And S550, fusing to obtain a second top view according to the first top view and the characteristics of each sampling point.

And S560, performing semantic segmentation on the second top view to obtain a category identification result of each grid unit in the second top view.

It is understood that the foregoing S520-S550 may be specifically implemented by a method according to any of the foregoing embodiments, and details thereof are not described herein again in the embodiments of the present disclosure.

According to the technical scheme of the embodiment of the disclosure, the high-precision second top view is generated based on the image generation methods according to the multiple panoramic images acquired by the multiple panoramic cameras under the multiple viewing angles, the implementation scheme of the category identification result of each grid unit in the two top views is obtained, the high-precision top view is generated through low-cost calculation, the object identification result of the surrounding environment can be accurately, efficiently and timely determined on the premise of controllable cost, and the actual environment identification requirement is met.

On the basis of the foregoing embodiments, after obtaining the result of identifying the category of each grid cell in the second top view, the method may further include:

and generating driving decision information of the vehicle according to the category identification result of each grid unit in the second top view.

In the optional embodiment, after the high-precision second top view is obtained through low calculation cost, the driving decision information can be accurately generated directly based on the semantic recognition result of the second top view, and the safety of the vehicle driving process is ensured.

By way of example and not limitation, a functional block diagram of an image segmentation method that may be implemented according to an embodiment of the present disclosure is shown in fig. 5 b. As shown in fig. 5b, the panoramic images at S (S > 1) viewing angles are acquired by a plurality of panoramic cameras disposed on the vehicle, and the S panoramic images are input to the multi-scale feature extractor, so that the multi-scale feature map corresponding to each viewing angle can be obtained. The multi-scale feature map has two purposes, one is that a low-resolution top view is obtained by inputting a lowest-resolution feature map in the multi-scale feature map into a cross-view converter, the other is that one or more high-resolution feature maps in the multi-scale feature map are subjected to cubic sampling aiming at the top view and then are input into a hierarchical decoder together with the low-resolution top view, the high-resolution top view is obtained by fusion, and finally, the high-resolution top view is subjected to semantic segmentation through a semantic segmenter, so that an object recognition result of the environment around the vehicle can be obtained.

As an implementation of each of the above image generation methods, the present disclosure also provides an optional embodiment of an execution device that implements each of the above image generation methods.

Fig. 6 is a structural diagram of an image generation apparatus provided according to an embodiment of the present disclosure, as shown in fig. 6, the image generation apparatus including: a multi-scale feature map acquisition module 610, a first top view generation module 620, a sampling module 630, and a second top view fusion module 640, wherein:

a multi-scale feature map obtaining module 610, configured to generate a multi-scale feature map corresponding to each view according to multiple ring-view images collected under multiple views, where the multi-scale feature map includes a first feature map at a first resolution and at least one second feature map at a second resolution, and the second resolution is higher than the first resolution;

a first top view generating module 620, configured to perform cross-view conversion on the first feature map at each view angle to generate a first top view;

the sampling module 630 is configured to sample, in the second feature map at least one second resolution at each view angle, a sample point feature of each grid unit at the at least one second resolution in the first top view;

and a second top view fusing module 640, configured to fuse the first top view and the features of the sampling points to obtain a second top view.

According to the technical scheme of the embodiment of the disclosure, multi-scale characteristic graphs respectively corresponding to all the visual angles are generated according to the all-round-view images collected under the multiple visual angles, and the first characteristic graph under each visual angle is subjected to cross-visual angle conversion to generate a first top view; sampling in at least one second feature map under at least one second resolution under each view angle to obtain the characteristics of sampling points of each grid unit under at least one second resolution in the first top view; and according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view, generating the first top view with higher calculation cost by using the low-resolution characteristic diagram, and then fusing the first top view with the high-resolution characteristics with lower calculation cost to obtain the second top view with high resolution. The resolution of the top view can be effectively improved with low-cost calculation.

On the basis of the foregoing embodiments, the multi-scale feature map acquisition module may be configured to:

and respectively carrying out multi-scale feature extraction on the all-around images under each view angle, and acquiring feature maps of each all-around image under multiple resolutions as the multi-scale feature maps respectively corresponding to each view angle.

On the basis of the foregoing embodiments, the first top view generating module may include:

the global feature generation unit is used for generating a first global feature and a second global feature which respectively correspond to the first feature graph under each view angle;

and the iteration generating unit is used for generating the first top view iteratively by adopting a multi-head attention mechanism according to the first global features, the second global features and the position coding value for describing the position relation between the image space and the top view space.

On the basis of the foregoing embodiments, the global feature generation unit may be configured to:

On the basis of the foregoing embodiments, the iteration generating unit may include:

the target key name determining subunit is used for determining each target key name parameter applied to the multi-head attention network according to the first global feature corresponding to each view angle, the camera coding value corresponding to each view angle and a preset pixel position coding value;

the target key value determining subunit is used for determining each target key value parameter applied to the multi-head attention network according to the second global feature corresponding to each view angle;

and the multi-head attention iteration unit is used for iteratively generating a first top view by adopting a multi-head attention network according to each target key name parameter, each target key value parameter and a preset grid unit position coding value in the overlooking space.

On the basis of the foregoing embodiments, wherein the multi-head attention iteration unit may be configured to:

under each iteration turn, acquiring a first top view obtained by the previous iteration turn as a historical top view;

calculating to obtain target query parameters applied to the multi-head attention network according to grid unit position coding values in historical top views and overlooking spaces;

and calculating to obtain a first top view under the current iteration turn by adopting a multi-head attention network according to each target key name parameter, each target key value parameter and each target query parameter.

On the basis of the foregoing embodiments, the sampling module may include:

the key point selecting unit is used for acquiring geographic area ranges respectively corresponding to each grid unit in the first top view and respectively selecting a plurality of key points in each geographic area range;

the sampling point feature acquisition unit is used for projecting each key point in each geographic area range to a second feature map under at least one second resolution under each view angle to obtain the sampling point feature of each key point under at least one second resolution;

and the sampling point characteristic summarizing unit is used for summarizing the sampling point characteristics of all key points belonging to the same grid unit under the same second resolution ratio to obtain the sampling point characteristics of each grid unit under at least one second resolution ratio in the first top view.

On the basis of the foregoing embodiments, the key point selecting unit may be configured to:

acquiring a planar rectangular position range of each grid unit in a first top view;

and forming a cubic area range corresponding to each grid unit according to the position range of each plane rectangle and a preset height value.

On the basis of the foregoing embodiments, the sampling point feature obtaining unit may include:

the geographic position coordinate acquisition subunit is used for acquiring the geographic position coordinate of the current key point in the current processing geographic area range;

the target view angle identification subunit is used for identifying at least one target view angle capable of shooting the current key point according to the geographic position coordinates;

and the sampling point characteristic acquisition subunit is used for acquiring the sampling point characteristics of the current key point under at least one second resolution according to the current projection position of the current key point in the second characteristic diagram under at least one second resolution under each target view angle.

On the basis of the foregoing embodiments, wherein the target view identification subunit is configured to:

acquiring the projection position of the current key point under each visual angle according to the geographical position information and the camera projection matrix of each visual angle;

and if the projection position of the current key point under the current visual angle is positioned in the image range of the current visual angle, determining the current visual angle as the target visual angle.

On the basis of the foregoing embodiments, the sampling point feature obtaining subunit may be configured to:

if the current projection position of the current key point in the current second feature map does not hit any feature point in the current second feature map, interpolating to obtain the feature at the current projection position as an alternative feature of the current key point under the current second resolution;

On the basis of the foregoing embodiments, the sample point feature obtaining subunit may be further configured to:

On the basis of the foregoing embodiments, the second top view fusion module may include:

the joint input unit is used for inputting the characteristics of each sampling point under the first top view and at least one second resolution into the decoding module set together;

the decoding module set is connected with a set number of decoding modules in series, and the number of the decoding modules connected in series is matched with the number of the second resolution;

the serial output unit is used for fusing to obtain a new top view and outputting the new top view through each decoding module according to the current input top view and the characteristics of each sampling point under the current input target second resolution;

and the second top view acquisition unit is used for acquiring the output top view of the decoding module of the last bit as a second top view.

On the basis of the foregoing embodiments, wherein the joint input unit may be configured to:

inputting the first top view into a first decoding module in the set of decoding modules;

and respectively inputting the characteristics of each sampling point under each second resolution into different decoding modules along the serial connection direction of the decoding modules according to the sequence of the second resolution from low to high.

On the basis of the foregoing embodiments, wherein the series output unit may include:

the adjustment output subunit is used for carrying out scale adjustment on the current input top view according to the target second resolution through each decoding module to obtain an adjusted top view;

a first weight map generating subunit, configured to generate a first weight map according to the adjusted top view;

the sampling point weighting characteristic obtaining subunit is used for carrying out weighting processing on the characteristics of each sampling point under the currently input target second resolution according to the first weight map to obtain the weighting characteristics of the sampling points;

and the top view output subunit is used for fusing the adjusted top view and the weighting characteristics of the sampling points through each decoding module to obtain and output a new top view.

On the basis of the foregoing embodiments, wherein the adjustment output subunit is configured to:

and carrying out convolution and interpolation processing on the current input top view according to the target second resolution to obtain an adjusted top view.

On the basis of the foregoing embodiments, the first weight map generating subunit may be configured to:

and inputting the adjusted top view into the first target full-connection network and the first logistic regression network in sequence to generate a first weight graph.

On the basis of the foregoing embodiments, the apparatus may further include a channel number adjusting unit, configured to:

and after weighting processing is carried out on the characteristics of each sampling point under the currently input target second resolution according to the first weight graph to obtain the weighting characteristics of the sampling points, the characteristic channel number of the weighting characteristics of the sampling points is adjusted according to the characteristic channel number of the adjusted top view.

On the basis of the foregoing embodiments, the top view output subunit may be configured to:

generating a first key name parameter and a first key value parameter corresponding to the adjusted top view;

generating a second key name parameter and a second key value parameter corresponding to the weighting characteristic of the sampling point;

generating a second weight map and a third weight map according to the first key name parameter and the second key name parameter;

and according to the second weight map and the third weight map, performing weighted summation on the first key value parameter and the second key value parameter, and processing a weighted summation result by adopting a preset activation function to obtain a new top view and output the new top view.

On the basis of the foregoing embodiments, the top view output subunit may be further configured to:

On the basis of the above embodiments, the top view output subunit may further be configured to:

inputting the parameters of the splicing key names into a sixth target full-connection network and a second logistic regression network in sequence to generate a combined weight graph;

On the basis of the above embodiments, the method may further include: and the semantic segmentation module is used for performing semantic segmentation on the second top view after the second top view is obtained by fusion according to the first top view and the characteristics of each sampling point, and acquiring the category identification result of each grid unit in the second top view.

The product can execute the method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

As an implementation of each of the above image generation methods, the present disclosure also provides an alternative embodiment of an execution device that implements each of the above image segmentation methods.

Fig. 7 is a structural diagram of an image segmentation apparatus provided according to an embodiment of the present disclosure. As shown in fig. 7, the image segmentation apparatus includes: a look-around image acquisition module 710, a fusion module 720, and an identification module 730. Wherein:

the around-looking image acquiring module 710 is configured to acquire a plurality of around-looking images at multiple viewing angles through a plurality of around-looking cameras;

a fusion module 720, configured to fuse the multiple ring-view images to obtain a second top view through the image generation method according to any embodiment of the present disclosure;

the identifying module 730 is configured to perform semantic segmentation on the second top view to obtain a category identifying result of each grid cell in the second top view.

According to the technical scheme of the embodiment of the disclosure, the implementation scheme that the high-precision second top view is generated based on the image generation methods according to the multiple panoramic images acquired by the multiple panoramic cameras under the multiple viewing angles and the category identification result of each grid unit in the two top views is obtained is realized, and the high-precision top view is generated through low-cost calculation.

On the basis of the above embodiments, the method may further include: and the driving decision information generating module is used for generating the driving decision information of the vehicle according to the class identification result of each grid unit in the second top view after the class identification result of each grid unit in the second top view is obtained.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device (in-vehicle terminal), a readable storage medium, and a computer program product according to an embodiment of the present disclosure.

FIG. 8 shows a schematic block diagram of an example electronic device (in-vehicle terminal) 800 that may be used to implement embodiments of the present disclosure. Electronic devices (vehicle terminals) are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, an electronic device (in-vehicle terminal) 800 includes a computing unit 801 that can execute various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus (in-vehicle terminal) 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A plurality of components in the electronic apparatus (in-vehicle terminal) 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device (in-vehicle terminal) 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as an image generation method or an image segmentation method. For example, in some embodiments, the image generation method or the image segmentation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device (in-vehicle terminal) 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the image generation method or the image segmentation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image generation method or the image segmentation method by any other suitable means (e.g., by means of firmware).

The image generation method comprises the following steps:

generating a multi-scale feature map corresponding to each view angle respectively according to a panoramic image acquired under multiple view angles, wherein the multi-scale feature map comprises a first feature map under first resolution and at least one second feature map under second resolution, and the second resolution is higher than the first resolution;

And, the image segmentation method, comprising:

collecting a plurality of all-around images under multiple visual angles through a plurality of all-around cameras;

performing cross-view conversion on the first characteristic diagram under each view angle to generate a first top view;

according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view;

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

Artificial intelligence is the subject of research that causes computers to simulate certain human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.

Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application such as artificial intelligence and block chains and model training.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, as long as the desired results of the technical solutions provided by the present disclosure can be achieved, which is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image generation method, comprising:

sampling in a second feature map under at least one second resolution under each view angle to obtain the characteristics of sampling points under at least one second resolution of each grid unit in the first top view;

2. The method of claim 1, wherein generating a multi-scale feature map corresponding to each view angle from the all-round images acquired at the multiple view angles comprises:

3. The method of claim 1, transforming the first feature map at each view angle across view angles to generate a first top view, comprising:

generating a first global feature and a second global feature respectively corresponding to the first feature graph under each view angle;

and iteratively generating the first top view by adopting a multi-head attention mechanism according to the first global features, the second global features and the position coding values for describing the position relation between the image space and the top view space.

4. The method of claim 3, wherein generating first and second global features corresponding to the first feature map at each perspective, respectively, comprises:

5. The method of claim 3, wherein iteratively generating the first top view using a multi-head attention mechanism based on the first global features, the second global features, and the position-coding values describing the positional relationship between the image space and the top-view space comprises:

determining each target key name parameter applied to the multi-head attention network according to the first global feature corresponding to each visual angle, the camera coding value corresponding to each visual angle and a preset pixel position coding value;

determining each target key value parameter applied to the multi-head attention network according to the second global feature corresponding to each visual angle;

and iterating to generate a first top view by adopting a multi-head attention network according to each target key name parameter, each target key value parameter and a preset grid unit position code value in a top view space.

6. The method of claim 5, wherein iteratively generating the first top view by using a multi-head attention network according to each target key name parameter, each target key value parameter, and a grid cell position code value in a preset top view space comprises:

calculating to obtain target query parameters applied to the multi-head attention network according to the grid unit position coding values in the historical top view and the overlooking space;

and calculating to obtain a first top view under the current iteration round by adopting a multi-head attention network according to each target key name parameter, each target key value parameter and each target query parameter.

7. The method of claim 1, wherein sampling the characteristics of the sample points at the at least one second resolution of each grid cell in the first top view in the at least one second feature map at the at least one second resolution at each view angle comprises:

acquiring geographical area ranges respectively corresponding to each grid unit in the first top view, and respectively selecting a plurality of key points in each geographical area range;

projecting each key point in each geographic area range to at least one second feature map under each view angle to obtain the sampling point feature of each key point under at least one second resolution;

and summarizing the sampling point characteristics of all key points belonging to the same grid unit under the same second resolution to obtain the sampling point characteristics of each grid unit under at least one second resolution in the first top view.

8. The method of claim 7, wherein obtaining the geographic area range corresponding to each respective grid cell in the first overhead view comprises:

9. The method of claim 8, wherein selecting a plurality of key points within each geographic area respectively comprises:

10. The method of claim 7, wherein projecting each keypoint within each geographic area into the second feature map at the at least one second resolution at each view to obtain sample point features for each keypoint at the at least one second resolution comprises:

acquiring the geographic position coordinates of the current key point in the current processing geographic area range;

identifying at least one target view angle capable of shooting the current key point according to the geographic position coordinates;

and obtaining the characteristics of the sampling points of the current key point under at least one second resolution according to the current projection position of the current key point in the second feature map under at least one second resolution under each target view angle.

11. The method of claim 10, wherein identifying at least one target perspective from which a current keypoint can be captured based on the geographic location coordinates comprises:

acquiring the projection position of the current key point under each view angle according to the geographical position information and the camera projection matrix of each view angle;

12. The method of claim 10, wherein obtaining the sample point feature of the current keypoint at the at least one second resolution according to the current projection position of the current keypoint in the second feature map at the at least one second resolution at each target view angle comprises:

if the current projection position of the current key point in the current second feature map under the current second resolution under the current target view angle hits the current feature point in the current second feature map, taking the feature of the current feature point as the alternative feature of the current key point under the current second resolution;

13. The method according to claim 12, wherein obtaining the sample point characteristics of the current keypoint at the current second resolution according to the alternative characteristics of the current keypoint acquired at each target view angle comprises:

14. The method of claim 1, wherein fusing to obtain a second top view, based on the first top view and the sample point features, comprises:

inputting the first top view and the characteristics of each sampling point under at least one second resolution into a decoding module set together;

fusing to obtain a new top view and outputting the new top view through each decoding module according to the characteristics of the sampling points under the current input top view and the current input target second resolution;

the output top view of the decoding module of the last bit is obtained as the second top view.

15. The method of claim 14, wherein the inputting the first top view and the at least one characteristic of each sample point at the second resolution into the set of decoding modules comprises:

16. The method of claim 15, wherein the merging, by each decoding module, a new top view from the current input top view and the characteristics of the sample points at the current input target second resolution to output comprises:

carrying out scale adjustment on the current input top view according to the target second resolution through each decoding module to obtain an adjusted top view;

generating a first weight map according to the adjusted top view;

weighting the characteristics of each sampling point under the currently input target second resolution according to the first weight graph to obtain the weighted characteristics of the sampling points;

and fusing the adjusted top view and the weighting characteristics of the sampling points to obtain a new top view and output the new top view.

17. The method of claim 16, wherein scaling, by each decoding module, the current input top view at the target second resolution to obtain an adjusted top view comprises:

18. The method of claim 16, wherein generating a first weight map from the adjusted top view comprises:

and inputting the adjusted top view into the first target full-connection network and the first logistic regression network in sequence to generate a first weight map.

19. The method according to claim 16, after weighting the characteristics of each sample point at the currently input target second resolution according to the first weight map to obtain weighted characteristics of the sample point, further comprising:

20. The method according to claim 19, wherein the fusing the adjusted top view and the sample point weighting characteristics to obtain a new top view and outputting the new top view comprises:

generating a second key name parameter and a second key value parameter corresponding to the sampling point weighting characteristics;

21. The method of claim 20, wherein generating a first key name parameter and a first key value parameter corresponding to the adjusted top view comprises:

generating a second key name parameter and a second key value parameter corresponding to the weighted characteristics of the sampling point, wherein the generating comprises the following steps:

22. The method of claim 20, wherein generating the second and third weight maps from the first and second key name parameters comprises:

23. The method according to any one of claims 1-22, further comprising, after fusing a second top view from the first top view and the sample point features:

24. An image segmentation method comprising:

fusing the plurality of ring view images into a second top view by the image generation method of any of claims 1-23;

and the grid units perform semantic segmentation on the second top view to obtain the category identification result of each grid unit in the second top view.

25. The method of claim 24, after obtaining the category identification result for each grid cell in the second overhead view, further comprising:

26. An image generation apparatus comprising:

the multi-scale characteristic map acquisition module is used for generating a multi-scale characteristic map corresponding to each visual angle according to a plurality of ring-view images acquired under the plurality of visual angles, wherein the multi-scale characteristic map comprises a first characteristic map under a first resolution and at least one second characteristic map under a second resolution, and the second resolution is higher than the first resolution;

the sampling module is used for sampling the second feature map under at least one second resolution under each view angle to obtain the characteristics of the sampling points of each grid unit under at least one second resolution in the first top view;

27. The apparatus of claim 26, wherein the multi-scale feature map acquisition module is to:

28. The apparatus of claim 26, the first overhead view generation module, comprising:

and the iteration generating unit is used for generating the first top view by iteration by adopting a multi-head attention mechanism according to the first global features, the second global features and the position coding values for describing the position relationship between the image space and the top view space.

29. The apparatus of claim 28, wherein the global feature generation unit is configured to:

30. The apparatus of claim 28, wherein the iteration generating unit comprises:

and the multi-head attention iteration unit is used for generating a first top view by iteration through a multi-head attention network according to each target key name parameter, each target key value parameter and a preset grid unit position code value in a top view space.

31. The apparatus of claim 30, wherein the multi-head attention iteration unit is to:

32. The apparatus of claim 26, wherein the sampling module comprises:

the sampling point characteristic acquisition unit is used for projecting each key point in each geographic area range to a second characteristic diagram under at least one second resolution under each view angle to obtain the sampling point characteristic of each key point under at least one second resolution;

33. The apparatus of claim 32, wherein the keypoint picking unit is configured to:

and forming cubic area ranges respectively corresponding to each grid unit according to the position ranges of the plane rectangles and the preset height value.

34. The apparatus of claim 33, wherein the keypoint extraction unit is configured to:

35. The apparatus as claimed in claim 32, wherein the sample point feature obtaining unit comprises:

and the sampling point characteristic acquisition subunit is used for acquiring the characteristics of the sampling points of the current key point at least one second resolution according to the current projection position of the current key point in the second characteristic diagram of the at least one second resolution under each target view angle.

36. The apparatus of claim 35, wherein a target view identification subunit is configured to:

37. The apparatus of claim 35, wherein the sample point feature acquisition subunit is configured to:

38. The apparatus of claim 37, wherein the sample point feature acquisition subunit is further configured to:

39. The apparatus of claim 26, wherein the second overhead view fusion module comprises:

the serial output unit is used for obtaining and outputting a new top view through fusion by each decoding module according to the current input top view and the characteristics of each sampling point under the current input target second resolution;

40. The apparatus of claim 39, wherein the joint input unit is configured to:

41. The apparatus of claim 40, wherein the series output unit comprises:

the sampling point weighting characteristic acquisition subunit is used for carrying out weighting processing on the characteristics of each sampling point under the currently input target second resolution according to the first weight graph to obtain the weighting characteristics of the sampling points;

and the top view output subunit is used for fusing the adjusted top view and the weighting characteristics of the sampling points to obtain a new top view and outputting the new top view.

42. The apparatus of claim 41, wherein the adjustment output subunit is to:

43. The apparatus of claim 41, wherein the first weight map generating subunit is to:

44. The apparatus of claim 41, further comprising a channel number adjustment unit to:

45. The apparatus of claim 44, wherein the top view output subunit is to:

46. An image segmentation apparatus comprising:

a fusion module, configured to fuse the plurality of panoramic images to obtain a second top view through the image generation method according to any one of claims 1 to 23;

47. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-23.

48. A vehicle-mounted terminal, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 24-25.

49. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the image generation method according to any one of claims 1 to 23 or the image segmentation method according to any one of claims 24 to 25.

50. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the image generation method of any one of claims 1 to 23, or the image segmentation method of any one of claims 24 to 25.