CN115909255B

CN115909255B - Image generation and image segmentation methods, devices, equipment, vehicle-mounted terminal and medium

Info

Publication number: CN115909255B
Application number: CN202310010749.3A
Authority: CN
Inventors: 龚石; 叶晓青; 蒋旻悦; 谭啸; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-06-06
Anticipated expiration: 2043-01-05
Also published as: CN115909255A

Abstract

The disclosure provides an image generation and image segmentation method, device, equipment, a vehicle-mounted terminal and a medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as automatic driving, smart cities and the like. The specific implementation scheme is as follows: generating a multi-scale feature map corresponding to each view angle respectively according to the all-around view images acquired under the multi-view angles; performing cross-view conversion on the first feature map under each view to generate a first top view; sampling in the second characteristic diagram under at least one second resolution of each view to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view; and according to the first top view and the characteristics of each sampling point, obtaining a second top view through fusion. According to the technical scheme, the high-precision top view can be generated on the premise of low calculation cost.

Description

Image generation and image segmentation methods, devices, equipment, vehicle-mounted terminal and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as automatic driving, smart cities and the like, and particularly relates to an image generation method, an image segmentation method, an image generation device, an image segmentation device, electronic equipment, a vehicle-mounted terminal and a non-transient computer readable storage medium.

Background

The task of perceptual recognition in autopilot is essentially a three-dimensional geometrical reconstruction of the physical world. As the diversity and number of autopilot car gear sensors becomes more complex, it becomes critical to characterize different viewing angles with a uniform viewing angle.

Bird's Eye View (BEV), also known as top View, is becoming increasingly popular as a natural and straightforward unified representation in the field of perception and prediction of autopilot.

When the related technology obtains the high-resolution aerial view, expensive calculation cost is required, and the dual requirements of people on low cost and instant perception cannot be met.

Disclosure of Invention

The present disclosure provides an image generation method, an image segmentation method, an image generation apparatus, an image segmentation apparatus, an electronic device, a vehicle-mounted terminal, and a non-transitory computer-readable storage medium.

According to an aspect of the present disclosure, there is provided an image generating method including:

generating a multi-scale feature map corresponding to each view angle according to the all-around image acquired under the multi-view angle, wherein the multi-scale feature map comprises a first feature map under a first resolution and at least one second feature map under a second resolution, and the second resolution is higher than the first resolution;

Performing cross-view conversion on the first feature map under each view to generate a first top view;

sampling in the second characteristic diagram under at least one second resolution of each view to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view;

and according to the first top view and the characteristics of each sampling point, obtaining a second top view through fusion.

According to another aspect of the present disclosure, there is provided an image segmentation method including:

collecting a plurality of looking-around images under a plurality of visual angles through a plurality of looking-around cameras;

fusing a plurality of looking-around images to obtain a second top view through the image generation method according to any embodiment of the disclosure;

and carrying out semantic segmentation on the second top view to obtain a category identification result of each grid unit in the second top view.

According to another aspect of the present disclosure, there is provided an image generating apparatus including:

the multi-scale feature map acquisition module is used for generating multi-scale feature maps corresponding to each view angle respectively according to a plurality of view-around images acquired under the multi-view angles, wherein the multi-scale feature maps comprise a first feature map under a first resolution and at least one second feature map under a second resolution, and the second resolution is higher than the first resolution;

The first top view generation module is used for performing cross-view conversion on the first feature map under each view angle to generate a first top view;

the sampling module is used for sampling in the second characteristic diagram under at least one second resolution of each view angle to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view;

and the second top view fusion module is used for fusing the first top view and the characteristics of each sampling point to obtain a second top view.

According to another aspect of the present disclosure, there is provided an image segmentation apparatus including:

the looking-around image acquisition module is used for acquiring a plurality of looking-around images under multiple visual angles through a plurality of looking-around cameras;

the fusion module is used for fusing the plurality of looking-around images to obtain a second top view through the image generation method according to any embodiment of the disclosure;

the identification module is used for carrying out semantic segmentation on the second top view and obtaining a category identification result of each grid unit in the second top view.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image generation method as set forth in any one of the present disclosure.

According to another aspect of the present disclosure, there is provided a vehicle-mounted terminal including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image segmentation method as set forth in any one of the present disclosure

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image generation method of any one of the present disclosure, or to perform the image segmentation method of any one of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 2a is a schematic diagram of another image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 2b is a flow chart for iteratively generating a first top view using a multi-headed attention mechanism as applicable to embodiments of the present disclosure;

FIG. 3 is a schematic diagram of yet another image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 4a is a schematic diagram of yet another image generation method provided in accordance with an embodiment of the present disclosure;

FIG. 4b is a schematic concatenation diagram of a set of decoding modules for use with embodiments of the present disclosure;

FIG. 4c is a schematic diagram of the logic operations within a decoding module to which embodiments of the present disclosure are applicable;

FIG. 5a is a schematic diagram of an image segmentation method according to an embodiment of the present disclosure;

FIG. 5b is a functional block diagram of an image segmentation method that may be implemented in accordance with an embodiment of the present disclosure;

fig. 6 is a block diagram of an image generating apparatus provided according to an embodiment of the present disclosure;

fig. 7 is a block diagram of an image segmentation apparatus provided according to an embodiment of the present disclosure;

Fig. 8 is a block diagram of an electronic device for implementing an image generating method of an embodiment of the present disclosure, or a block diagram of a vehicle-mounted terminal for implementing an image dividing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of an image generation method provided according to an embodiment of the present disclosure. The embodiments of the present disclosure may be applied to a case where a top view in a top view angle is generated using a plurality of looking-around images acquired in a looking-around angle. The method may be performed by an image generating device, which may be implemented in hardware and/or software, and may be generally integrated in a terminal or server having a data processing function.

As shown in fig. 1, an image generating method provided in an embodiment of the present disclosure includes the following specific steps:

S110, generating a multi-scale feature map corresponding to each view angle according to the all-around view image acquired under the multi-view angle.

Wherein the multi-scale feature map comprises a first feature map at a first resolution and at least one second feature map at a second resolution, the second resolution being higher than the first resolution.

In this embodiment, the looking-around image is a plurality of images acquired under a plurality of looking-around angles, where the plurality of looking-around angles may be understood as selecting a plurality of view points in a continuous direction such as front, rear, left, right, etc. of a set fixed point to take a photograph, so as to acquire a plurality of looking-around images equivalent to 360 degrees looking-around from the fixed point. Correspondingly, under a visual angle, a looking-around image with a set looking-around range can be acquired.

The multi-scale feature map can be understood as extracting multi-scale image features of a view-around image to obtain feature maps with different resolutions.

Wherein the multi-scale feature map may be further subdivided into a first feature map at a first resolution and at least one second feature map at a second resolution. The first resolution is understood to be the lowest resolution of the resolutions included in the multi-scale feature map. The second resolution may be understood as the full resolution above the lowest resolution at each resolution contained in the multi-scale feature map.

In one example, if the resolution of the looking-around image a is 768×768, the multi-scale feature map corresponding to the looking-around image a may include the feature map 1 with the resolution of 512×512, the feature map 2 with the resolution of 256×256 and the feature map 3 with the resolution of 32×32.

In this example, the first resolution is 32×32, the first feature map is feature map 3, the second resolution is 512×512 and 256×256, and the second feature map is feature map 1 and feature map 2.

It will be appreciated that if only two feature maps at two resolutions are included in the multi-scale feature map, then the second resolution, and the number of second feature maps at the second resolution, are each unique. If the multi-scale feature map includes three or more feature maps at three or more resolutions, the number of the second feature maps at the second resolution and the number of the second feature maps at the second resolution are all plural, that is, the total number of resolutions included in the multi-scale feature map is-1.

S120, performing cross-view conversion on the first feature map under each view angle to generate a first top view.

The first plan view is understood to be an image acquired from a plan view, and the above-described fixed point is taken as an example, and the first plan view is understood to be an image formed when image acquisition is performed from the upper portion of the fixed point.

In this embodiment, the plan view may be understood as a mesh map in which a plurality of mesh cells, that is, a plurality of lattices in the mesh map, each mesh cell corresponding to a set geographical position range are divided. The grid map can be constructed by using the position of the fixed point as a central point, and can also be constructed according to actual longitude and latitude information. Each grid cell contains image features of one or more objects that appear within a geographic location range that matches the grid cell.

In an actual application scene, it is difficult to directly obtain the first top view, so that the first top view can be obtained by performing cross-view conversion on image features of the panoramic image under multiple views. Specifically, the first top view may be generated by inverse perspective transformation or cross-view learning, or the like.

In any of the methods for generating the first plan view, a relatively complex calculation amount is required, and the calculation amount increases significantly as the resolution of the feature map of the through-image to be used increases. In this embodiment, in order to reduce the amount of computation to the greatest extent, a first top view is generated using a first feature map with the lowest resolution at a plurality of viewing angles. Since the resolution of the first feature map used for generating the first top view is not high, the resolution of the finally formed image of the first top view is not high, and it is generally difficult to satisfy practical application requirements.

By way of example and not limitation, assuming that 4 looking-around images are acquired under four viewing angles respectively, a first feature map of a first resolution is extracted from each looking-around image respectively, and a first top view matched with the first resolution can be obtained by performing cross-viewing angle conversion on the first feature maps of the 4 different viewing angles.

In other words, the operation of S120 is performed in a manner that sacrifices the accuracy of the first top view, thereby greatly saving the calculation cost. Therefore, the accuracy of the first top view needs to be compensated in a certain way later.

S130, in the second characteristic diagram under at least one second resolution of each view angle, sampling to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view.

In this embodiment, the accuracy compensation is performed on the first top view by selecting and using a higher-accuracy feature map obtained after multi-scale feature extraction, that is, one or more second feature maps at the second resolution at each view angle.

After the first top view is obtained, geographic position range information corresponding to each grid cell can be determined, and then, through the known mapping relationship between different view angles and the geographic position range, the characteristics of each grid cell in the first top view in the second feature map, namely, the characteristics of sampling points, are determined.

The feature of the sampling point may be understood as a feature of a certain image point in the second feature map, or a feature of an interpolated image point obtained by interpolating the second feature map. Accordingly, a sample point feature may be understood as a high resolution feature having a resolution higher than the resolution of the first top view.

By way of example and not limitation, assuming that 4 looking-around images are acquired at four viewing angles, each of the looking-around images is extracted to obtain two second feature images at a second resolution, for example, a second feature image 1 at a second resolution 512×512 and a second feature image 2 at a second resolution 256×256, sampling on the 4*2 Zhang Dier feature images (the second feature image 1 and the second feature image 2) may respectively obtain one or more sampling point features of each grid cell at the second resolution 512×512 in the first top view, and one or more sampling point features of each grid cell at the second resolution 256×256 in the first top view.

It can be understood that, since the sampling process of the sampling point feature is simple to calculate and can be obtained only by simple projection mapping or interpolation, the calculation amount of the process is small and the process is easy to obtain.

And S140, according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view.

In this embodiment, after the first top view with low resolution is acquired, one or more high-precision feature points with at least one high resolution are acquired for each grid cell of the first top view. Further, the first plan view with the first resolution and the high-resolution sampling point features can be fused to obtain a second plan view with a high resolution.

It will be appreciated that the number of grid cells included in the first top view and the second top view, and the geographical location range represented by each grid cell are the same, and the difference between them is that the resolution of the grid cell feature in each grid cell in the second top view is higher, and is closer to the image feature acquired from the actual top view perspective.

The second top view may be obtained by fusing a fixed weight or a dynamic weight, which is not limited in this embodiment.

According to the technical scheme, according to the panoramic image acquired under multiple visual angles, a multi-scale feature map corresponding to each visual angle is generated, and the first feature map under each visual angle is subjected to cross-visual angle conversion to generate a first top view; sampling in the second characteristic diagram under at least one second resolution of each view to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view; and generating a first top view with higher calculation cost by using the low-resolution feature map according to the mode of obtaining a second top view by fusing the first top view and the features of each sampling point, and then obtaining the second top view with high resolution by fusing the high-resolution features with lower calculation cost with the first top view. The resolution of the top view can be improved effectively with low cost calculation.

On the basis of the above embodiments, generating a multi-scale feature map corresponding to each view angle according to the panoramic image acquired under the multi-view angle may include:

acquiring a plurality of panoramic images respectively acquired under multiple visual angles;

and respectively carrying out multi-scale feature extraction on the looking-around images under each view angle, and obtaining feature images of each looking-around image under a plurality of resolutions as multi-scale feature images respectively corresponding to each view angle.

In an alternative embodiment, a Multi-scale feature extractor (Multi-Scale Feature Extractor, MSFE) may be pre-trained, the MSFE being responsible for extracting Multi-scale feature maps from the look-around image at each view angle. Typically, these feature maps can be represented as { fi }, where the subscript i has a value of an integer greater than 1, and the resolution of the feature map is obtained by downsampling the looking-around image by a factor of 2i, for example, when i=5, the resolution of the feature map { f5} is 1/32 of the image resolution of the looking-around image.

In an alternative implementation manner of this embodiment, the value of i includes three of 3,4 and 5. Wherein { f5} corresponds to a first feature map, and { f3} and { f4} correspond to two second feature maps at two different second resolutions.

The MSFE may have different designs, such as a skeleton network res net using a classification model, or a Vision Transformer network (ViT) or Pyramid Vision Transformer (PVT) based on an attention mechanism, which is not limited in this embodiment.

Through the arrangement, the characteristic diagrams under a plurality of resolutions can be simply and conveniently extracted at one time aiming at a single looking-around image, and meanwhile, the requirements of subsequently generating a first top view and collecting the characteristics of sampling points of each grid unit in the first top view under at least one second resolution are met.

Fig. 2a is a flowchart of another image generation method provided in accordance with an embodiment of the present disclosure. In this embodiment, the operation of "converting the first feature map at each view angle across the view angles to generate the first top view" is refined.

As shown in fig. 2a, an image generating method provided in an embodiment of the present disclosure includes the following specific steps:

s210, generating a multi-scale feature map corresponding to each view angle according to the all-around view image acquired under the multi-view angle.

S220, generating a first global feature and a second global feature which correspond to the first feature map under each view.

The global features refer to features obtained by extracting global information in the feature map, the global features generally refer to overall attributes of the image, and common global features can include color features, texture features, shape features, such as intensity histograms, and the like.

In this embodiment, in order to better form the first top view, first, in the first feature map at each view angle, the first global feature and the second global feature are extracted respectively. That is, if three first feature maps under three perspectives are acquired in total, one first global feature and one second global feature may be generated for each first feature map, respectively, that is, three first global features and three second global features are obtained for three first feature maps in total.

In this embodiment, the first global feature and the second global feature are generated differently because it is considered that not all the features at all positions in each first feature map have the same weight when the first top view is finally generated. It is contemplated that the image features at specific locations should have a greater (or lesser) weight when generating the features of the specific grid cells in the first top view, and in order to mine the weight information, two types of global features, i.e., the first global feature and the second global feature, may be generated separately for the same first feature map.

Accordingly, the second global feature may be used to synthesize the first top view, and the first global feature may be used to describe the weight values of the features at different locations in the second global feature.

Alternatively, two pre-trained global feature extractors may be constructed to extract the first global feature and the second global feature in the first feature map at each view angle, respectively. Wherein the extraction of the first global feature and the second global feature may be performed using the fully connected network, considering that the fully connected network is the simplest global feature extractor.

Accordingly, generating the first global feature and the second global feature corresponding to the first feature map at each view may include:

generating first global features corresponding to the first feature diagrams under each view through a first full-connection network;

and generating second global features corresponding to the first feature diagrams under each view through a second fully connected network.

As described above, the first global feature and the second global feature have different roles in the process of generating the first top view, so that the fully connected networks with the same structure can be trained respectively based on different training targets, so as to obtain the first fully connected network and the second fully connected network which meet the actual requirements.

Through the arrangement, two types of global features corresponding to the first feature map under each view angle respectively, namely the first global feature and the second global feature, can be obtained simply and conveniently, and further, the first top view meeting the requirements can be accurately generated by using the first global feature and the second global feature.

S230, iteratively generating a first top view by adopting a Multi-Head Attention (MHA) mechanism according to each first global feature, each second global feature and a position coding value for describing the position relation between an image space and a top view space.

In this embodiment, in order to realize the conversion across viewing angles, it is necessary to specify the mapping relationship between the looking-around viewing angle and the looking-down viewing angle. The above-described mapping relationship may be described by one or more position-coded values for describing a positional relationship between the image space and the top view space.

The position code value may include one or more of a camera code value, a preset pixel position code value, and a grid cell position code value in a top view space, which correspond to each view angle, respectively.

After the position coding values are obtained, the second global features under each view angle can be mapped into different grid cells in the first top view according to the weights determined by the first global features.

Wherein, to generate a more accurate first top view, embodiments of the present disclosure may employ a multi-head attention network based on a multi-head attention mechanism, the first top view being generated in multiple iterations.

In this embodiment, since the first plan view is generated using the multi-head attention network, three important parameters, namely, a Key name parameter (Key), a Key Value parameter (Value) and a Query parameter (Query), which are required to be used by the multi-head attention network need to be determined.

Aiming at the application scene for generating the first top view, the three parameters are required to be endowed with actual physical meaning, and the target key name parameter, the target key value parameter and the target query parameter are obtained.

The target key value parameter is used for generating a first top view, the target key name parameter is used for describing weight values of all features in the target key value parameter when the first top view is generated, and the target query parameter is the first top view obtained under each iteration round. By performing a number of iterations (e.g., 3, 4, 5, etc.) based on the multi-headed attention network, a first top view meeting a preset accuracy requirement may be ultimately output.

A flow chart for iteratively generating a first top view using a multi-headed attention mechanism, to which embodiments of the present disclosure are applicable, is shown in fig. 2 b.

As shown in fig. 2b, the manner of iteratively generating the first top view by using the multi-head attention mechanism according to each first global feature, each second global feature and the position coding value for describing the position relationship between the image space and the top view space may include:

s2301, determining each target key name parameter applied to the multi-head attention network according to the first global feature corresponding to each view, the camera coding value corresponding to each view and the preset pixel position coding value.

For each view angle, the sum of the first global feature corresponding to the view angle, the camera coding value corresponding to the view angle and the preset pixel position coding value can be used as a target key name parameter corresponding to the view angle.

In a specific example, assume that the first feature map of the looking-around image at the set view angle X is { f5}, and the first feature map is processed by the first full connection layer FCk to obtain the first global feature FCk (f 5). When the looking-around image under the view angle X is acquired, assuming that the looking-around camera X1 is used, the camera code value PE1 corresponding to the view angle X is a code value corresponding to the looking-around camera X1, and the camera code value PE1 may be a randomly initialized learnable vector and is obtained through pre-training.

The pixel position encoded value PE2 may be a predetermined trigonometric function encoded value. Optionally, the pixel position encoded value PE2 has the same spatial dimension as the first feature map { f5 }. For example, 32 x 32.

Accordingly, the target key name parameter k= FCk (f 5) +pe1+pe2 corresponding to the view angle X. It can be understood that N (N > 1) views correspond to N first global features, N camera code values, and the same pixel position code value, and further N target key name parameters can be obtained altogether to correspond to N views.

S2302, determining each target key value parameter applied to the multi-head attention network according to the second global features corresponding to each view.

In an optional implementation manner of this embodiment, the second global features corresponding to each view may be directly used as a target key value parameter.

In the previous example, assuming that the first feature map of the looking-around image at the view angle X is { f5}, the first feature map is processed by the second full-connection layer FCv to obtain the first global feature FCv (f 5), and further, the target key value parameter v= FCv (f 5) corresponding to the view angle X can be directly set.

Correspondingly, N (N > 1) views correspond to N second global features, and further N target key value parameters can be obtained altogether, so as to correspond to N views.

S2303, under the current iteration round, acquiring a first top view obtained by previous iteration round as a historical top view.

In the present embodiment, the operations of S2303 to S2305 are performed separately at each iteration round. After each iteration round is executed, a first top view under the iteration round can be obtained.

Accordingly, under the first iteration round, since there is no first top view obtained by the previous iteration round, an initialization vector (for example, may be an all-zero vector) may be constructed as the first historical top view. Starting from the second iteration round, the first top view obtained in the previous iteration round can be obtained as a historical top view.

In a specific example, at the iter iteration round, a first top view Qiter-1 of the (iter-1) th iteration may be obtained.

S2304, calculating to obtain the target query parameters applied to the multi-head attention network according to the grid cell position coding values in the history top view and the top view space.

The grid cell position code value in the top view space may be a predetermined trigonometric function code value corresponding to each grid cell position in the first top view.

Alternatively, the sum of the grid cell position encoded values PE3 in the historic top view and top view space may be used as the target query parameter Q applied in the multi-head attention network.

In one specific example, at the iter-th iteration round: q=pe3+qiter-1.

In the previous example, N (N > 1) first feature graphs under N viewing angles all correspond to the same target query parameter.

S2305, calculating to obtain a first top view under the current iteration round by adopting a multi-head attention network according to each target key name parameter, each target key value parameter and the target query parameter.

In this embodiment, under the ith iteration round, the first top view Qiter under the current iteration round may be obtained by jointly inputting the N target key name parameters, the N target key value parameters, and the unique target query parameter under the N view angles into the multi-head attention network.

Alternatively, the calculation formula of the first top view of the ith iteration round may be:

Qiter=(K,V,Q)

the MHA (-) is a conversion function executed by the multi-head attention network, K is the target key name parameter, V is the target key value parameter, and Q is the target query parameter.

S2306, judging whether the current iterated times reach a preset iterated times threshold: if yes, executing S2307; otherwise, execution returns to S2303 to start a new iteration round.

In this embodiment, under each iteration round, the second global feature weighted by the first global feature under each view angle may be converted into the grid cell feature of each grid cell in the first top view, where the grid cell feature of each grid cell in the first top view is closer to the top view feature acquired in the real world after multiple iterations, although the first top view is initialized to an all-zero vector.

The iteration number threshold may be understood as the total number of iteration rounds. That is, the number of repeated executions of the operations of S2303 to S2305. It will be appreciated that the greater the iteration number threshold, the greater the accuracy of the first top view, but the greater the computational cost required, and therefore. One skilled in the art can choose one iteration threshold satisfying both accuracy and computational cost requirements according to the actual situation.

S2307, ending the iterative process and outputting the first top view.

S240, in the second characteristic diagram under at least one second resolution of each view angle, sampling to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view.

S250, according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view.

According to the technical scheme, the first global features and the second global features corresponding to the first feature map under each view angle are generated; according to the first global features, the second global features and the position coding values for describing the position relation between the image space and the overlook space, a mode of generating the first top view in an iterative mode is adopted, the first top view meeting the requirements can be generated with less calculation cost by applying a mature data processing mechanism of the multi-head attention mechanism, meanwhile, the first top view meeting different precision requirements can be generated in different application scenes simply by adjusting the iteration frequency threshold through a multiple iteration mode, the implementation mode is simple, and the flexibility is high.

On the basis of the foregoing embodiments, after the second top view is obtained by fusing the first top view and the features of each sampling point, the method may further include:

In this embodiment, after the second top view with high resolution is obtained, the second top view may be subjected to semantic segmentation to obtain a category identification result in each grid cell in the second top view.

Alternatively, the category recognition result within each grid cell in the second top view may be generated by a pre-trained semantic segmentation network. The category identification result in each grid cell in the second top view may specifically include: a class probability map having a channel number C (C > 1) is output for each grid cell of the second plan view.

Wherein the channel number C represents that the semantic segmentation network can identify C different objects. By way of example and not limitation, when c=3, the semantic segmentation network can identify three different types of objects, human, vehicle, and building. Further, the semantic segmentation network may output, for each grid cell in the second top view, a shape such as: people: class probability map for 0.1, vehicle 0.88, building 0.02.

In this embodiment, the semantic segmentation network may be formed by sequentially connecting a first full-connection layer (FC), a normalization layer (typically, battnorm), an activation layer (typically, reLU), a second full-connection layer (FC), and a logistic regression layer (typically, sigmoid) in series. Of course, the semantic segmentation model may be set according to actual requirements, and this embodiment does not set this.

Through the arrangement, various objects included in the surrounding scene can be timely identified in the second top view with high resolution, and further, decision information matched with the identification result can be better generated in the automatic driving or intelligent monitoring field, so that the actual demands of users are met.

Fig. 3 is a schematic diagram of still another image generating method according to an embodiment of the present disclosure, in which the operation of "sampling a feature of each grid cell in the first top view at least one second resolution in the second feature map at least one second resolution" is further refined.

As shown in fig. 3, an image generating method provided in an embodiment of the present disclosure includes the following specific steps:

s310, generating a multi-scale feature map corresponding to each view angle according to the all-around view image acquired under the multi-view angle.

S320, performing cross-view conversion on the first feature map under each view angle to generate a first top view.

S330, obtaining geographic area ranges corresponding to each grid unit in the first top view, and respectively selecting a plurality of key points in each geographic area range.

In this embodiment, only the graphic features in a particular grid cell or cells in the top view can be acquired in view of the looking-around image for each view. Based on this, in the first plan view, a plurality of key points may be selected from each grid unit by taking the grid unit as a unit, and then the key points are mapped reversely to the second feature maps, so that the sampling point feature of each grid unit in the first plan view under at least one second resolution can be obtained according to the mapping positions of the key points in the second feature maps.

As described above, each grid cell in the first plan view corresponds to a geographical area range set in the geographical space, and the geographical area range may be a stereoscopic image with a set shape or a planar image with a set shape. In this embodiment, in order to ensure the dispersibility of the sampling points, a preset height value may be used to construct a cubic shape (for example, a cylinder, a cone, or a cube, etc.) corresponding to each grid cell, respectively, as the geographical area range of each grid cell.

Accordingly, in an optional implementation manner of this embodiment, obtaining the geographical area range corresponding to each grid cell in the first top view may include:

acquiring a plane rectangular position range of each grid unit in the first top view; and forming a cube region range corresponding to each grid unit according to the position range of each plane rectangle and the preset height value.

In the present embodiment, since the first plan view is a mesh map, the shape of each mesh unit is a plane rectangle, and therefore, the plane rectangle for dividing each mesh unit, or an inscribed rectangle slightly smaller than the plane rectangle, may be directly used as the plane rectangle position range of each mesh unit in the first plan view. Thereafter, in order to secure the dispersibility of different sampling points, a cube region range corresponding to each grid cell is constructed based on a preset height value (e.g., 3 meters, 4 meters, or 5 meters, etc.).

Through the arrangement, the dispersibility of the collection of the characteristics of the sampling points can be guaranteed, the sampling effect is improved, and then, the fusion effect and the resolution ratio of the second top view can be improved.

After the geographical area range corresponding to each grid unit is acquired, a plurality of key points for projection are selected in each geographical area range. Alternatively, the above-mentioned plurality of key points may be selected by selecting each corner point in the range of the cube region, or by randomly selecting points in the range of the whole cube region, which is not limited in this embodiment.

In an optional implementation manner of this embodiment, selecting a plurality of key points in each geographical area respectively may include:

and selecting a preset number of spherical neighborhood points as key points in each geographical area range.

The advantages of this arrangement are: by arranging a sphere in the cube corresponding to each geographical area range and selecting a plurality of neighborhood points (namely spherical neighborhood points) corresponding to the sphere center (namely grid cell center points) in the built-in sphere, a plurality of key points can be determined at the same distance from the grid cell center points as fairly as possible, and further the high-resolution sampling point characteristics of the key points can be used for enhancing the resolution of the first top view and guaranteeing the fusion effect of the second top view.

And S340, projecting each key point in each geographical area range into a second characteristic diagram under at least one second resolution under each view angle, and obtaining the sampling point characteristic of each key point under at least one second resolution.

In the present embodiment, it is assumed that the multi-scale feature map generated for each view angle includes a plurality of second feature maps at the second resolution. For example, the second feature map 1 with the second resolution of 512×512 and the second feature map 2 with the second resolution of 256×256. After obtaining the plurality of key points corresponding to each grid unit, each key point can be respectively projected into the characteristic diagram 1 corresponding to each view angle and the characteristic diagram 2 corresponding to each view angle, so as to obtain the sampling point characteristic of each key point for all view angles under at least one second resolution.

Projecting each key point in each geographical area range to a second feature map under at least one second resolution under each view angle to obtain a sampling point feature of each key point under at least one second resolution, which may include:

obtaining geographic position coordinates of a current key point in a current processing geographic area range; identifying at least one target view angle capable of shooting a current key point according to the geographic position coordinates; and obtaining the sampling point characteristic of the current key point under at least one second resolution according to the current projection position of the current key point in the second characteristic diagram under at least one second resolution under each target view angle.

Through the arrangement, the target view angle of each key point can be rapidly positioned, and then the sampling point characteristic of each key point under each second resolution is accurately acquired according to the second characteristic diagram under each second resolution under the target view angle, so that the realization mode is simple, the calculated amount is small, and the real-time sampling point characteristic acquisition requirement can be met.

Wherein, according to the geographic position coordinates, identifying at least one target view angle capable of shooting the current key point may include:

Acquiring the projection position of the current key point under each view angle according to the geographic position information and the camera projection matrix of each view angle; and if the projection position of the current key point under the current view angle is positioned in the image range of the current view angle, determining the current view angle as the target view angle.

The geographic position coordinate of the current key point may be a relative geographic position coordinate of the current key point relative to the first top view center point, or may be an absolute geographic position coordinate of the current key point under a geographic coordinate system, which is not limited in this embodiment.

The looking-around image of each view angle can be shot by the looking-around camera under the view angle, after the shooting view angle is fixed, each looking-around camera can firstly calibrate the internal and external parameters of the looking-around camera, and then the camera projection matrix of each looking-around camera can be obtained and used as the camera projection matrix corresponding to the view angle shot by the looking-around camera. And correspondingly multiplying the geographical position information by the camera projection matrix of each view angle respectively to obtain the projection position of the current key point under each view angle.

If the projection position of the current key point under a certain visual angle is positioned in the image range under the visual angle, the current key point is shot in the all-around image under the visual angle. Further, the viewing angle may be regarded as a target viewing angle; if the projection position of the current key point under a certain view angle exceeds the image range under the view angle, the fact that the current key point cannot be shot in the all-around image under the view angle is indicated, and the sampling point feature of the current key point is acquired from the second feature map under at least one second resolution under the view angle.

It will be appreciated by those skilled in the art that the same current keypoint may be acquired simultaneously by multiple view angle (typically 2) panoramic images, and that the current keypoint may correspond to multiple target views. Accordingly, a current keypoint may correspond to one or more target perspectives.

The advantages of this arrangement are: the target visual angle can be rapidly screened out for subsequent calculation through simple matrix multiplication operation, and the calculation amount is small and easy to realize.

The obtaining the sampling point feature of the current key point under the at least one second resolution according to the current projection position of the current key point in the second feature map under the at least one second resolution under each target view angle may include:

if the current projection position of the current key point in the current second feature map under the current second resolution of the current target view angle hits the current feature point in the current second feature map, taking the feature of the current feature point as an alternative feature of the current key point under the current second resolution;

if the current projection position of the current key point in the current second feature map does not hit any feature point in the current second feature map, interpolating to obtain the feature at the current projection position, and taking the feature as an alternative feature of the current key point under the current second resolution;

And obtaining the sampling point characteristics of the current key point under the current second resolution according to the alternative characteristics of the current key point obtained under each target view angle.

In this embodiment, if the current projection position of the current key point in the current second feature map does not hit any feature point in the current second feature map, the current second feature map may be subjected to interpolation processing through a set interpolation algorithm, for example, a bilinear interpolation algorithm, so as to expand feature points included in the current second feature map, so as to finally obtain features of the current key point at the current projection position in the current second feature map.

Through the arrangement, whether the current key point is the existing characteristic point in the current second characteristic diagram or not, the sampling point characteristics of the current key point can be accurately represented by each characteristic point in the second characteristic diagram, and an accurate and reliable data source is provided for the subsequent fusion stage of the second top view.

According to the alternative features obtained by the current key point under each target view angle, obtaining the sampling point features of the current key point under the current second resolution may include:

if the number of the obtained alternative features is a plurality of, pooling processing is carried out on each alternative feature to obtain the sampling point feature of the current key point under the current second resolution.

In this embodiment, if the current key point corresponds to multiple target views at the current second resolution, an alternative feature is obtained at each target view, that is, a situation that the current key point corresponds to multiple alternative features at the same current second resolution occurs. At this time, the sampling point features of the current key point under the current second resolution may be generated jointly according to the multiple candidate features at the same time.

The sampling point characteristic of the current key point under the current second resolution can be obtained by carrying out pooling processing on each alternative characteristic, typically, an average pooling processing mode.

The advantages of this arrangement are that: when the same key point appears in the areas shot by the multiple view angles, the second characteristic diagrams of the second resolutions of the multiple view angles can be combined to determine the sampling point characteristics of the key point under the second resolutions, so that the defect of the sampling point characteristics is avoided to the greatest extent, and the accuracy of the sampling point characteristics is improved.

S350, summarizing the sampling point characteristics of each key point belonging to the same grid unit under the same second resolution, and obtaining the sampling point characteristics of each grid unit under at least one second resolution in the first top view.

In the present embodiment, it is assumed that there are two second resolutions in total, the second resolution 1 and the second resolution 2. The sample point characteristics at the second resolution 1 for the plurality of keypoints in each grid cell in the first top view, and the sample point characteristics at the second resolution 2 for the plurality of keypoints in each grid cell in the first top view, respectively, need to be summarized accordingly.

S360, according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view.

According to the technical scheme, the geographical area range corresponding to each grid unit in the first top view is obtained, and a plurality of key points are selected from each geographical area range; projecting each key point in each geographical area range into a second characteristic diagram under at least one second resolution under each view angle to obtain a sampling point characteristic of each key point under at least one second resolution; the sampling point characteristics of each key point belonging to the same grid unit under the same second resolution are summarized, the sampling point characteristics of each grid unit under at least one second resolution in the first top view are obtained, the dispersibility and the accuracy of the collection of the sampling point characteristics can be ensured, and furthermore, the fusion effect and the resolution of the second top view can be effectively improved.

Fig. 4a is a schematic diagram of yet another image generation method provided according to an embodiment of the present disclosure. In this embodiment, the "fusing the features of the first top view and the sampling points to obtain the second top view" operation is further refined.

As shown in fig. 4a, an image generating method provided in an embodiment of the present disclosure includes the following specific steps:

s410, generating a multi-scale feature map corresponding to each view angle according to the all-around view image acquired under the multi-view angle.

Wherein the multi-scale feature map comprises a first feature map at a first resolution and at least one second feature map at a second resolution, the second resolution being higher than the first resolution;

s420, performing cross-view conversion on the first feature map under each view angle to generate a first top view.

S430, in the second characteristic diagram under at least one second resolution of each view angle, sampling to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view.

S440, the first top view and the features of each sampling point under at least one second resolution are input into the decoding module set together.

The decoding module set is connected with a set number of decoding modules in series, and the number of the decoding modules connected in series is matched with the number of the second resolution.

By way of example and not limitation, if the number of second resolutions included in the multi-scale feature map is unique, only one decoding module is included in the set of decoding modules; if the number of the second resolutions included in the multi-scale feature map is 2, the decoding module set includes two decoding modules connected in series, and so on, which will not be described in detail.

Optionally, inputting the features of each sampling point in the first top view and at least one second resolution together into the decoding module set may include:

inputting the first top view to a first decoding module in the set of decoding modules; and respectively inputting the characteristics of each sampling point under each second resolution into different decoding modules along the serial connection direction of the decoding modules according to the sequence from low resolution to high resolution.

The input of the first decoding module is a first top view obtained by performing cross-view conversion on the first feature map under each view angle, and the output of the first decoding module is an output top view obtained by performing decoding processing on input information;

the input of the second decoding module is the output top view of the first decoding module, and the characteristics of each sampling point under the second resolution with the second low resolution, the output of the second decoding module is the output top view obtained after the input information is decoded, and so on, and no further description is given.

In a specific example, it is assumed that the multi-scale feature map corresponding to each view angle in S410 includes feature map 1 at the second resolution 512×512 and feature map 2 at the second resolution 256×256. After the processing of S430, the sampling point feature { S1} of each grid cell in the first top view under the second resolution 512×512 and the sampling point feature { S2} of each grid cell in the first top view under the second resolution 256×256 are obtained, and for the sampling point feature of the above type, a serial schematic diagram of a decoding module set suitable for the foregoing type is shown in fig. 4 b.

In fig. 4b, the set of decoding modules includes two decoding modules, namely, decoding module 1 and decoding module 2, wherein the input of decoding module 1 is a first top view and the sample point feature { S2} at the second resolution of 256×256, the output of decoding module 1 is an output top view obtained by decoding the first top view and { S2}, the input of decoding module is an output top view of decoding module 1 and the sample point feature { S1} at the second resolution of 512×512, and the output is a second top view.

Through the multi-stage decoding process, the resolution of the top view can be improved step by step, and a better fusion effect can be obtained with lower calculation cost.

S450, through each decoding module, according to the current input top view and the characteristics of each sampling point under the current input target second resolution, obtaining a new top view through fusion and outputting.

If the decoding module is the first decoding module in the decoding module set, the current input top view of the decoding module is the first top view, and if the decoding module is not the first decoding module in the decoding module set, the current input top view of the decoding module is the output top view of the previous decoding module.

In this embodiment, weights such as the features of each sampling point under the second resolution of the currently input target may be fused into a unified sampling point feature, and then the unified sampling point feature and the currently input top view are fused; or, the weight of each sampling point feature under the current input target second resolution may be divided, and the weighting fusion may be performed according to the weight division result, so as to obtain the sampling point weighting feature, and further, the sampling point weighting feature and the current input top view are fused.

In an optional implementation manner of this embodiment, by each decoding module, according to the current input top view and each sampling point feature under the current input target second resolution, a new top view is obtained by fusion and output, which may include:

The current input top view is subjected to scale adjustment according to the target second resolution through each decoding module, and an adjusted top view is obtained; generating a first weight graph according to the adjusted top view; weighting each sampling point characteristic under the second resolution of the currently input target according to the first weight graph to obtain a sampling point weighting characteristic; and fusing the adjusted top view with the sampling point weighting characteristics to obtain a new top view and outputting the new top view.

The advantages of this arrangement are that: by generating a first weight map corresponding to each sampling point feature under the target second resolution and using the first weight map to carry out weighting processing on each sampling point feature under the target second resolution, the finally obtained sampling point weighted feature can be as far as possible prone to the sampling point feature with high importance degree, and further, the fused new top view can obtain higher resolution on the object with higher importance degree.

In this alternative embodiment, the current input top view needs to be scaled according to the target second resolution, because the resolution of the current input top view is smaller than the target second resolution, and in order to ensure the consistency of the scales of the two in the subsequent fusion, the current input top view needs to be scaled first. For example, if the current input top view is a first top view with a resolution of 32×32 and the target second resolution is 256×256, the resolution of the current input top view needs to be adjusted from 32×32 to 256×256, and a specific adjustment manner may be interpolation zero padding or feature point duplication padding, which is not limited in this embodiment.

The method for adjusting the scale of the current input top view according to the target second resolution through each decoding module to obtain an adjusted top view may include:

and rolling and interpolating the current input top view according to the target second resolution to obtain an adjusted top view.

Through the arrangement, the size of the current input top view can be simply, timely and conveniently adjusted to be matched with the target second resolution, and the subsequent top view fusion requirement is met.

Wherein, according to the adjusted top view, generating the first weight map may include:

the adjusted top views are sequentially input into a first target fully connected network (FC) and a first logistic regression network (softmax) to generate a first weight graph.

In this embodiment, the adjusted top views with dimensions matched with the second resolution of the target are sequentially input into the first target fully-connected network and the first logistic regression network which are connected in series and trained in advance, so that a first weight graph for describing the feature importance degree of each sampling point can be obtained.

The first weight graph is a set weight vector, and each vector element in the weight vector is used for describing the weight coefficient of each sampling point characteristic under the target second resolution. The higher the weight coefficient, the higher the importance that the sample point feature corresponding to the weight coefficient plays in the entire top view.

Through the arrangement, through the fully-connected network and the logistic regression network which are mature in the use technology, the position of the sampling point feature with higher importance in the top view after adjustment can be accurately and efficiently identified, and further, more accurate sampling point weighting features can be obtained.

In an optional implementation manner of this embodiment, after each sampling point feature at the currently input target second resolution is weighted according to the first weight map, obtaining a sampling point weighted feature may further include:

and adjusting the number of the characteristic channels of the weighted characteristics of the sampling points according to the number of the characteristic channels of the adjusted top view.

In this embodiment, a feature channel may be understood as a number of feature dimensions included in one feature point. In the scale adjustment stage, the scale adjustment of the adjusted top view and the target second resolution corresponding to the feature of the sampling point are consistent, but the feature dimension (the number of feature channels) of the feature of the sampling point is not consistent with the number of feature channels of the adjusted top view, so that the feature channels of the two are required to be adjusted to be consistent in order to ensure the feasibility of fusing the two subsequently.

Alternatively, the number of feature channels of the sampling point weighting feature may be adjusted to match the number of feature channels of the adjusted top view by inputting the sampling point weighting feature into a fully-connected network that matches the number of feature channels of the adjusted top view.

Through the arrangement, the number of the characteristic channels of the weighted characteristics of the top view and the sampling points after adjustment can be aligned fast, and the subsequent fusion requirement is met.

In an optional implementation manner of this embodiment, the fusing the adjusted top view with the weighted features of the sampling points to obtain a new top view and outputting the new top view may include:

generating a first key name parameter and a first key value parameter corresponding to the adjusted top view; generating a second key name parameter and a second key value parameter corresponding to the sampling point weighting characteristic; generating a second weight map and a third weight map according to the first key name parameter and the second key name parameter; and according to the second weight graph and the third weight graph, carrying out weighted summation on the first key value parameter and the second key value parameter, adopting a preset activation function to process the weighted summation result, obtaining a new top view and outputting the new top view.

In the optional embodiment, based on a Key-Value mechanism, the importance degree of the weighted characteristics of the adjusted top view and the sampling point is determined, and the weighted fusion of the weighted characteristics of the adjusted top view and the sampling point is carried out according to the importance degree of the two, so that a new top view is obtained and output. Through the arrangement, the high-resolution fused top view can be accurately and efficiently obtained.

The generating the first key name parameter and the first key value parameter corresponding to the adjusted top view may include:

sequentially inputting the adjusted top view to a second target fully-connected network, a first Bayesian network and a first activation network to generate a first key name parameter;

sequentially inputting the adjusted top views to a third target fully-connected network and a second Bayesian network to generate a first key value parameter;

correspondingly, generating the second key name parameter and the second key value parameter corresponding to the sampling point weighting feature may include:

sequentially inputting the weighted characteristics of the sampling points into a fourth target full-connection network, a third Bayesian network and a second activation network to generate a second key name parameter;

and sequentially inputting the sampling point weighting characteristics into a fifth target fully-connected network and a fourth Bayesian network to generate a second key value parameter.

In addition, generating the second weight map and the third weight map according to the first key name parameter and the second key name parameter may include:

performing characteristic splicing on the first key name parameter and the second key name parameter to obtain a spliced key name parameter;

sequentially inputting the splicing key name parameters into a sixth target fully-connected network and a second logistic regression network to generate a combined weight graph;

And respectively extracting a second weight graph and a third weight graph from the combined weight graph.

For ease of understanding, a schematic diagram of the logic operations within a decoding module to which embodiments of the present disclosure are applicable is shown in fig. 4 c.

As shown in fig. 4C, let the scale (resolution) of the current input top view be h×w, the number of channels be C, be the right input source in fig. 4C, the scale of the N sampling point features at the target second resolution of the current input be n×2h×2w, the number of channels be C1, be the left input source in fig. 4C. That is, the horizontal resolution and the vertical resolution of the sample point feature are both 2 times that of the current input top view.

First, regarding the top view of h×w×c, a specific scaling manner is shown in fig. 4C, that is, the top view of h×w×c is sequentially input to a fully connected network (FC), a Bayesian Network (BN), an activated network (ReLU), a fully connected network (FC), a Bayesian Network (BN), a Bilinear interpolation network (Bilinear), and an activated network (ReLU), the resolution of the top view is doubled, the number of channels is halved, and the following shapes are obtained: 2h x 2w x c/2.

After passing the 2H 2W C/2 top view sequentially through a fully connected network (FC (C/2, N)) that adjusts the number of C/2 channels to N channels and a logistic regression network (softmax), a weight map (i.e., a first weight map) of N channels for which the sampling point characteristics are adapted can be obtained

After passing the 2h x 2w x c/2 top view through a fully connected network (FC) and a Bayesian Network (BN) in sequence, obtaining a key value parameter Φvalue matching the top view; meanwhile, by passing the top view of the 2h x 2w x c/2 sequentially through a fully connected network (FC), a Bayesian Network (BN) and an active network (ReLU), a key name parameter Φkey matching the top view is obtained.

In addition, for the sampling point feature of n×2h×2w×c1, first, the sampling point feature of n×2h×2w×c1 is weighted and summed with the obtained weight map of N channels, and the N sampling point features are weighted to be a sampling point weighting feature, so as to obtain a sampling point feature of 2h×2w×c1.

After the sampling point feature of 2h×2w×c1 sequentially passes through a full connection network (FC), a Bayesian Network (BN), and an active network (ReLU), the number of C1 channels in the sampling point feature may be adjusted to the number of channels of C/2. Then, obtaining a key value parameter phi value matched with the sampling point characteristic by sequentially inputting the sampling point characteristic of 2H x 2W x C/2 into a fully connected network (FC) and a Bayesian Network (BN); by sequentially inputting the 2h x 2w x c/2 sampling point features into a fully connected network (FC), a Bayesian Network (BN) and an activated network (typically, a ReLU activation function), a key name parameter Φkey matching the sampling point features is obtained.

And combining the key name parameter phi key matched with the top view and the key name parameter phi key matched with the characteristics of the sampling point to obtain the key name parameter of the C channel, and respectively inputting the key name parameter to a fully connected network (FC (C, 2)) for adjusting the C channel to 2 channels to obtain weight vectors w1 and w2 of the two channels.

Finally, the key value parameter phi value matched with the top view and the key value parameter phi value matched with the sampling point feature are respectively weighted and summed with w1 and w2, and the weighted and summed result is input into an activation network (ReLU), so that a new top view output by the activation module can be obtained.

S460, obtaining an output top view of the last decoding module as a second top view.

According to the technical scheme, the first top view and the characteristics of each sampling point under at least one second resolution are input into a decoding module set together; through each decoding module, according to the current input top view and the characteristics of each sampling point under the current input target second resolution, obtaining a new top view through fusion and outputting; the mode of obtaining the output top view of the last decoding module as the second top view can improve the fusion effect of the second top view and ensure the resolution requirement on the second top view.

It should be noted that, in the embodiment of the present disclosure, the multi-scale feature extractor may obtain the multi-scale feature map, generate the first top view through the trans-view converter, and fuse the first top view through the hierarchical decoder to obtain the second top view.

Wherein, this cross visual angle converter can include: a first fully-connected network, a second fully-connected network, and a multi-headed attentive network. The hierarchical decoder may be made up of a plurality of concatenated decoding modules. Wherein, each decoding module can include: the first target fully-connected network, the second target fully-connected network, the third target fully-connected network, the fourth target fully-connected network, the fifth target fully-connected network, the sixth target fully-connected network, the first logistic regression network, the second logistic regression network, the first bayesian network, the second bayesian network, the third bayesian network, the fourth bayesian network, the first active network, and the second active network. Specifically, the multi-scale feature extractor, each network in the cross-view converter, and each network in the hierarchical decoder may be obtained by pre-training a machine learning model of a set structure using a set training sample.

Fig. 5a is a schematic diagram of an image segmentation method according to an embodiment of the disclosure. The embodiments of the present disclosure may be applied to the case of generating a top view and semantically segmenting the top view. The method may be performed by an image segmentation apparatus, which may be implemented in hardware and/or software, and may be generally integrated in a vehicle-mounted terminal.

As shown in fig. 5a, an image segmentation method provided in an embodiment of the present disclosure includes the following specific steps:

s510, acquiring a plurality of looking-around images under a plurality of viewing angles through a plurality of looking-around cameras.

In this embodiment, a plurality of looking-around cameras may be provided on a set carrying device (for example, a vehicle, a ship, or an aircraft, etc.), and a plurality of looking-around images under multiple angles of view may be acquired.

Optionally, if the carrying device is a vehicle, a plurality of looking-around cameras may be disposed at different positions of the vehicle body, and the looking-around cameras are configured to collect a plurality of looking-around images from a plurality of view angles with the vehicle as a center, so as to achieve an effect of looking-around 360 degrees around the vehicle.

Accordingly, the vehicle may be a normal vehicle, an automated driving vehicle, a vehicle with a driving assistance function, or the like, which is not limited in this embodiment. Accordingly, the solution of the embodiment of the present disclosure may be applied to a general driving scenario, an automatic driving scenario, and an auxiliary driving scenario, which is not limited in this embodiment.

S520, generating a multi-scale feature map corresponding to each view angle according to the all-around view image acquired under the multi-view angle.

S530, performing cross-view conversion on the first feature map under each view angle to generate a first top view.

S540, in the second characteristic diagram under at least one second resolution of each view angle, sampling to obtain the sampling point characteristic of each grid unit under at least one second resolution in the first top view.

S550, according to the first top view and the characteristics of each sampling point, fusing to obtain a second top view.

S560, performing semantic segmentation on the second top view, and obtaining a category identification result of each grid unit in the second top view.

It can be appreciated that the above S520-S550 may be specifically implemented by the method of any of the foregoing embodiments, which is not repeated in the embodiments of the present disclosure.

According to the technical scheme, the method and the device for generating the high-precision second top view based on the image generation method are achieved according to the multiple panoramic images under the multiple visual angles acquired by the multiple panoramic cameras, the type identification result of each grid unit in the two top views is obtained, the high-precision top view is generated through low-cost calculation, the object identification result of the surrounding environment can be accurately, efficiently and timely determined on the premise of controllable cost, and the actual environment identification requirement is met.

On the basis of the foregoing embodiments, after obtaining the category identification result of each grid cell in the second top view, the method may further include:

and generating driving decision information of the vehicle according to the category identification result of each grid cell in the second plan view.

In this optional embodiment, after the second top view with high precision is obtained through lower calculation cost, the driving decision information can be accurately generated directly based on the semantic recognition result of the second top view, so as to ensure the safety of the vehicle driving process.

By way of example, and not limitation, a functional block diagram is shown in fig. 5b that may implement an image segmentation method provided by embodiments of the present disclosure. As shown in fig. 5b, through a plurality of looking-around cameras arranged on the vehicle, looking-around images under S (S > 1) viewing angles are acquired, and by inputting the S Zhang Huanshi images into a multi-scale feature extractor, a multi-scale feature map corresponding to each viewing angle can be acquired. The multi-scale feature map has two purposes, namely, a low-resolution top view is obtained by inputting the lowest-resolution feature map in the multi-scale feature map to a trans-visual angle converter, one is obtained by performing cube sampling on one or more high-resolution feature maps in the multi-scale feature map and then inputting the low-resolution feature maps and the top view together into a layering decoder, the high-resolution top view is obtained through fusion, and finally, the high-resolution top view is subjected to semantic segmentation through a semantic segmenter, so that an object recognition result of the surrounding environment of the vehicle can be obtained.

As an implementation of the above-mentioned image generation methods, the present disclosure also provides an optional embodiment of an execution apparatus that implements the above-mentioned image generation methods.

Fig. 6 is a block diagram of an image generating apparatus according to an embodiment of the present disclosure, as shown in fig. 6, including: a multi-scale feature map acquisition module 610, a first top view generation module 620, a sampling module 630, and a second top view fusion module 640, wherein:

a multi-scale feature map obtaining module 610, configured to generate a multi-scale feature map corresponding to each view angle according to a plurality of view-around images acquired under multiple view angles, where the multi-scale feature map includes a first feature map under a first resolution and at least one second feature map under a second resolution, and the second resolution is higher than the first resolution;

the first top view generating module 620 is configured to perform cross-view conversion on the first feature map under each view angle, so as to generate a first top view;

the sampling module 630 is configured to sample, in the second feature map at the at least one second resolution at each view angle, a feature of a sampling point of each grid cell at the at least one second resolution in the first top view;

And the second top view fusion module 640 is configured to fuse the first top view with each feature of the sampling points to obtain a second top view.

Based on the above embodiments, the multi-scale feature map acquisition module may be used to:

On the basis of the above embodiments, the first top view generating module may include:

the global feature generation unit is used for generating a first global feature and a second global feature which respectively correspond to the first feature map under each view angle;

and the iteration generating unit is used for iteratively generating a first top view by adopting a multi-head attention mechanism according to each first global feature, each second global feature and a position coding value for describing the position relationship between the image space and the top view space.

On the basis of the foregoing embodiments, the global feature generating unit may be configured to:

On the basis of the foregoing embodiments, the iteration generating unit may include:

the target key name determining subunit is used for determining each target key name parameter applied to the multi-head attention network according to the first global feature corresponding to each view angle respectively, the camera coding value corresponding to each view angle respectively and the preset pixel position coding value;

A target key value determining subunit, configured to determine each target key value parameter applied to the multi-head attention network according to a second global feature corresponding to each view angle;

and the multi-head attention iteration unit is used for iteratively generating a first top view by adopting a multi-head attention network according to each target key name parameter, each target key value parameter and a grid unit position coding value in a preset top view space.

On the basis of the above embodiments, the multi-head attention iteration unit may be configured to:

under each iteration round, acquiring a first top view obtained by previous round iteration as a historical top view;

calculating to obtain target query parameters applied to the multi-head attention network according to the historical top view and grid cell position coding values in the top view space;

and calculating to obtain a first top view under the current iteration round according to each target key name parameter, each target key value parameter and the target query parameter by adopting a multi-head attention network.

On the basis of the foregoing embodiments, the sampling module may include:

the key point selection unit is used for acquiring geographic area ranges corresponding to each grid unit in the first top view respectively, and selecting a plurality of key points in each geographic area range respectively;

The sampling point feature acquisition unit is used for projecting each key point in each geographical area range to a second feature map under at least one second resolution under each view angle to obtain the sampling point feature of each key point under at least one second resolution;

and the sampling point feature summarizing unit is used for summarizing the sampling point features of each key point belonging to the same grid unit under the same second resolution to obtain the sampling point features of each grid unit under at least one second resolution in the first top view.

On the basis of the above embodiments, the key point selecting unit may be configured to:

acquiring a plane rectangular position range of each grid unit in the first top view;

and forming a cube region range corresponding to each grid unit according to the position range of each plane rectangle and the preset height value.

On the basis of the foregoing embodiments, the sampling point feature acquiring unit may include:

the geographic position coordinate acquisition subunit is used for acquiring geographic position coordinates of the current key points in the current processing geographic area range;

A target view angle identification subunit, configured to identify, according to the geographic location coordinates, at least one target view angle from which a current key point can be captured;

and the sampling point characteristic obtaining subunit is used for obtaining the sampling point characteristic of the current key point under at least one second resolution according to the current projection position in the second characteristic diagram of the current key point under at least one second resolution under each target view angle.

On the basis of the above embodiments, the target viewing angle identifying subunit may be configured to:

acquiring the projection position of the current key point under each view angle according to the geographic position information and the camera projection matrix of each view angle;

and if the projection position of the current key point under the current view angle is positioned in the image range of the current view angle, determining the current view angle as the target view angle.

On the basis of the above embodiments, the sampling point feature acquiring subunit may be configured to:

On the basis of the foregoing embodiments, the sampling point feature acquiring subunit may be further configured to:

On the basis of the foregoing embodiments, the second top view fusion module may include:

the joint input unit is used for inputting the characteristics of each sampling point under the first top view and at least one second resolution into the decoding module set together;

the decoding module set is connected with a set number of decoding modules in series, and the number of the decoding modules connected in series is matched with the number of the second resolution;

the serial output unit is used for obtaining a new top view through fusion according to the characteristics of each sampling point under the current input top view and the current input target second resolution through each decoding module;

And the second top view acquisition unit is used for acquiring the output top view of the last decoding module as a second top view.

On the basis of the above embodiments, the joint input unit may be configured to:

inputting the first top view to a first decoding module in the set of decoding modules;

and respectively inputting the characteristics of each sampling point under each second resolution into different decoding modules along the serial connection direction of the decoding modules according to the sequence from low resolution to high resolution.

On the basis of the above embodiments, the serial output unit may include:

the adjustment output subunit is used for performing scale adjustment on the current input top view according to the target second resolution through each decoding module to obtain an adjusted top view;

the first weight map generation subunit is used for generating a first weight map according to the adjusted top view;

the sampling point weighting characteristic obtaining subunit is used for carrying out weighting processing on each sampling point characteristic under the second resolution of the currently input target according to the first weight graph to obtain a sampling point weighting characteristic;

and the top view output subunit is used for fusing the adjusted top view and the sampling point weighting characteristics through each decoding module to obtain a new top view and output the new top view.

On the basis of the above embodiments, the adjustment output subunit may be configured to:

On the basis of the foregoing embodiments, the first weight map generating subunit may be configured to:

and sequentially inputting the adjusted top views to a first target fully-connected network and a first logistic regression network to generate a first weight graph.

On the basis of the above embodiments, the present invention may further include a channel number adjustment unit configured to:

and after the sampling point characteristics under the current input target second resolution are weighted according to the first weight graph to obtain sampling point weighted characteristics, the number of characteristic channels of the sampling point weighted characteristics is adjusted according to the number of characteristic channels of the adjusted top view.

On the basis of the above embodiments, the top view output subunit may be configured to:

generating a first key name parameter and a first key value parameter corresponding to the adjusted top view;

generating a second key name parameter and a second key value parameter corresponding to the sampling point weighting characteristic;

generating a second weight map and a third weight map according to the first key name parameter and the second key name parameter;

And according to the second weight graph and the third weight graph, carrying out weighted summation on the first key value parameter and the second key value parameter, adopting a preset activation function to process the weighted summation result, obtaining a new top view and outputting the new top view.

On the basis of the above embodiments, the top view output subunit may be further configured to:

On the basis of the above embodiments, the method may further include: the semantic segmentation module is used for carrying out semantic segmentation on the second top view after the second top view is obtained through fusion according to the first top view and the characteristics of each sampling point, and obtaining a category identification result of each grid unit in the second top view.

The product can execute the method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of executing the method.

As an implementation of the above-mentioned image generation methods, the present disclosure also provides an optional embodiment of an execution apparatus that implements the above-mentioned image segmentation methods.

Fig. 7 is a block diagram of an image segmentation apparatus provided according to an embodiment of the present disclosure. As shown in fig. 7, the image dividing apparatus includes: a look-around image acquisition module 710, a fusion module 720, and an identification module 730. Wherein:

a looking-around image acquisition module 710, configured to acquire a plurality of looking-around images under a plurality of viewing angles through a plurality of looking-around cameras;

a fusion module 720, configured to fuse the plurality of ring-view images to obtain a second top view through the image generating method according to any embodiment of the disclosure;

And the identification module 730 is configured to perform semantic segmentation on the second top view, and obtain a category identification result of each grid unit in the second top view.

According to the technical scheme, the method and the device for generating the high-precision second top view based on the image generation method are achieved according to the multiple panoramic images under the multiple visual angles acquired by the multiple panoramic cameras, the type identification result of each grid unit in the two top views is obtained, the high-precision top view is generated through low-cost calculation, the object identification result of the surrounding environment can be accurately, efficiently and timely determined on the premise that the implementation cost is controllable, and the actual environment identification requirement is met.

On the basis of the above embodiments, the method may further include: and the driving decision information generation module is used for generating driving decision information of the vehicle according to the category identification result of each grid cell in the second plan view after the category identification result of each grid cell in the second plan view is acquired.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device (in-vehicle terminal), a readable storage medium, and a computer program product.

Fig. 8 shows a schematic block diagram of an example electronic device (in-vehicle terminal) 800 that may be used to implement embodiments of the present disclosure. Electronic devices (vehicle terminals) are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, an electronic device (in-vehicle terminal) 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device (in-vehicle terminal) 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

A plurality of components in the electronic apparatus (in-vehicle terminal) 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic apparatus (in-vehicle terminal) 800 to exchange information/data with other apparatuses through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, such as an image generation method or an image segmentation method. For example, in some embodiments, the image generation method or the image segmentation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device (in-vehicle terminal) 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the image generation method or the image segmentation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image generation method or the image segmentation method in any other suitable way (e.g. by means of firmware).

The image generation method comprises the following steps:

And, the image segmentation method includes:

according to the first top view and the characteristics of each sampling point, a second top view is obtained through fusion;

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image generation method, comprising:

sampling in a second characteristic diagram under at least one second resolution under each view angle according to the geographical position range of each grid unit in the first top view and the mapping relation between different view angles and the geographical position range to obtain sampling point characteristics of each grid unit under at least one second resolution in the first top view;

according to the first top view and each sampling point feature, a second top view is obtained through fusion, and the method comprises the following steps:

the first top view and the characteristics of each sampling point under at least one second resolution are input into a decoding module set together;

according to the top view of the current input and the characteristics of each sampling point under the target second resolution of the current input, each decoding module in the decoding module set is used for obtaining a new top view through fusion and outputting the new top view;

And obtaining an output top view of the last decoding module as a second top view.

2. The method of claim 1, wherein generating a multi-scale feature map corresponding to each view separately from the pan-around image acquired at the multiple views comprises:

3. The method of claim 1, converting the first feature map at each view across views to generate a first top view, comprising:

generating a first global feature and a second global feature which respectively correspond to the first feature map under each view angle;

and iteratively generating a first top view by adopting a multi-head attention mechanism according to each first global feature, each second global feature and a position coding value for describing the position relation between the image space and the top view space.

4. A method according to claim 3, wherein generating a first global feature and a second global feature corresponding to the first feature map at each view, respectively, comprises:

5. A method according to claim 3, wherein iteratively generating a first top view using a multi-headed attention mechanism based on each first global feature, each second global feature, and a position-coded value describing a positional relationship between image space and top view space, comprises:

determining each target key name parameter applied to the multi-head attention network according to the first global feature corresponding to each view, the camera coding value corresponding to each view and the preset pixel position coding value;

determining target key value parameters applied to the multi-head attention network according to second global features corresponding to each view angle respectively;

and according to each target key name parameter, each target key value parameter and a grid cell position coding value in a preset overlook space, adopting a multi-head attention network to iteratively generate a first overlook.

6. The method of claim 5, wherein iteratively generating a first plan view using a multi-headed attention network based on each target key name parameter, each target key value parameter, and a grid cell position code value in a preset plan view space, comprises:

7. The method of claim 1, wherein sampling the second feature map at the at least one second resolution for each grid cell in the first top view according to the geographical location range of each grid cell in the first top view and the mapping relationship between the different perspectives and the geographical location ranges to obtain the sampled point feature at the at least one second resolution for each grid cell in the first top view comprises:

obtaining geographic area ranges corresponding to each grid unit in the first top view respectively, and selecting a plurality of key points in each geographic area range respectively;

projecting each key point in each geographical area range into a second characteristic diagram under at least one second resolution under each view angle to obtain a sampling point characteristic of each key point under at least one second resolution;

Summarizing the sampling point characteristics of each key point belonging to the same grid unit under the same second resolution, and obtaining the sampling point characteristics of each grid unit under at least one second resolution in the first top view.

8. The method of claim 7, wherein obtaining a geographic area range corresponding to each grid cell in the first top view, respectively, comprises:

9. The method of claim 8, wherein selecting a plurality of keypoints within each geographic region comprises:

10. The method of claim 7, wherein projecting keypoints within each geographic region into the second feature map at the at least one second resolution at each view angle results in a sampled point feature at the at least one second resolution for each keypoint, comprising:

obtaining geographic position coordinates of a current key point in a current processing geographic area range;

Identifying at least one target view angle capable of shooting a current key point according to the geographic position coordinates;

and obtaining the sampling point characteristic of the current key point under at least one second resolution according to the current projection position of the current key point in the second characteristic diagram under at least one second resolution under each target view angle.

11. The method of claim 10, wherein identifying at least one target perspective from which a current keypoint can be captured based on the geographic location coordinates comprises:

acquiring the projection position of the current key point under each view angle according to the geographic position coordinates and the camera projection matrix of each view angle;

12. The method of claim 10, wherein deriving the sample point feature of the current keypoint at the at least one second resolution from the current projection position in the second feature map of the current keypoint at the at least one second resolution at each target view angle comprises:

13. The method of claim 12, wherein obtaining the sample point feature of the current keypoint at the current second resolution from the candidate features of the current keypoint obtained at each target view angle comprises:

14. The method of claim 1, wherein a set number of decoding modules are concatenated in the set of decoding modules, and the number of concatenated decoding modules matches the number of second resolutions.

15. The method of claim 14, wherein inputting the sample point features at the first top view and at least one second resolution together into the set of decoding modules comprises:

16. The method of claim 15, wherein, by each decoding module, a new top view is obtained by merging and outputting the top view according to the current input top view and the features of the sampling points at the current input target second resolution, and the method comprises:

the current input top view is subjected to scale adjustment according to the target second resolution through each decoding module, and an adjusted top view is obtained;

generating a first weight graph according to the adjusted top view;

weighting each sampling point characteristic under the second resolution of the currently input target according to the first weight graph to obtain a sampling point weighting characteristic;

and fusing the adjusted top view with the sampling point weighting characteristics to obtain a new top view and outputting the new top view.

17. The method of claim 16, wherein scaling the current input top view by each decoding module according to the target second resolution to obtain an adjusted top view comprises:

18. The method of claim 16, wherein generating a first weight map from the adjusted top view comprises:

19. The method of claim 16, further comprising, after weighting each sample feature at the target second resolution of the current input according to the first weight map to obtain a sample weighted feature:

20. The method of claim 19, wherein fusing the adjusted top view with the sample point weighting features to obtain a new top view and outputting, comprising:

21. The method of claim 20, wherein generating the first key name parameter and the first key value parameter corresponding to the adjusted top view comprises:

generating a second key name parameter and a second key value parameter corresponding to the sampling point weighting feature, including:

22. The method of claim 20, wherein generating a second weight map and a third weight map from the first key name parameter and the second key name parameter comprises:

23. The method of any of claims 1-22, further comprising, after merging the second top view from the first top view and each sample feature:

24. An image segmentation method, comprising:

fusing the plurality of ring-view images to obtain a second top view by the image generation method as claimed in any one of claims 1 to 23;

and the grid cells perform semantic segmentation on the second top view, and category identification results of each grid cell in the second top view are obtained.

25. The method of claim 24, further comprising, after obtaining the category identification result for each grid cell in the second top view:

26. An image generating apparatus comprising:

the sampling module is used for sampling to obtain sampling point characteristics of each grid unit in the first top view under at least one second resolution in the second characteristic map under at least one second resolution according to the geographical position range of each grid unit in the first top view and the mapping relation between different view angles and the geographical position range;

the second top view fusion module is used for fusing the first top view and the characteristics of each sampling point to obtain a second top view;

wherein, the second top view fuses the module, includes:

The serial output unit is used for obtaining a new top view through fusion according to the top view input currently and the characteristics of each sampling point under the target second resolution input currently through each decoding module in the decoding module set;

27. The apparatus of claim 26, wherein the multi-scale feature map acquisition module is configured to:

28. The apparatus of claim 26, the first top view generation module comprising:

29. The apparatus of claim 28, wherein the global feature generation unit is configured to:

30. The apparatus of claim 28, wherein the iteration generating unit comprises:

31. The apparatus of claim 30, wherein the multi-headed attention iteration unit is configured to:

32. The apparatus of claim 26, wherein the sampling module comprises:

33. The apparatus of claim 32, wherein the keypoint selection unit is configured to:

34. The apparatus of claim 33, wherein the keypoint selection unit is configured to:

35. The apparatus of claim 32, wherein the sampling point feature acquisition unit comprises:

36. The apparatus of claim 35, wherein the target perspective recognition subunit is configured to:

37. The apparatus of claim 35, wherein the sample point feature acquisition subunit is configured to:

38. The apparatus of claim 37, wherein the sample point feature acquisition subunit is further configured to:

39. The apparatus of claim 26, wherein a set number of decoding modules are concatenated in the set of decoding modules, and the number of concatenated decoding modules matches the number of second resolutions.

40. The apparatus of claim 39, wherein the joint input unit is configured to:

41. The apparatus of claim 40, wherein the series output unit comprises:

and the top view output subunit is used for fusing the adjusted top view with the sampling point weighting characteristics to obtain a new top view and outputting the new top view.

42. The apparatus of claim 41, wherein the adjustment output subunit is configured to:

43. The apparatus of claim 41, wherein the first weight map generation subunit is configured to:

44. The apparatus of claim 41, further comprising a channel number adjusting unit configured to:

45. The apparatus of claim 44, wherein the top view output subunit is configured to:

46. An image segmentation apparatus comprising:

a fusion module, configured to fuse the plurality of ring-view images to obtain a second top view through the image generating method according to any one of claims 1 to 23;

47. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions for execution by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-23.

48. An in-vehicle terminal, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions for execution by the at least one processor to enable the at least one processor to perform the method of any one of claims 24-25.

49. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image generation method according to any one of claims 1-23 or to perform the image segmentation method according to any one of claims 24-25.