CN114677507A

CN114677507A - Street view image segmentation method and system based on bidirectional attention network

Info

Publication number: CN114677507A
Application number: CN202210236443.5A
Authority: CN
Inventors: 史彦丽; 盛鹏鹏
Original assignee: Jilin Institute of Chemical Technology
Current assignee: Jilin Institute of Chemical Technology
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-28

Abstract

The invention relates to a street view image segmentation method and a system based on a bidirectional attention network, belonging to the field of image segmentation processing, wherein urban street image data are collected firstly, and a training data set is constructed after preprocessing; establishing a bidirectional attention network segmentation model which takes street view images as input and predictive pictures corresponding to the street view images as output, wherein the bidirectional attention network segmentation model comprises a semantic information extraction network with an attention mechanism, a spatial information extraction network and a multi-scale fusion module; the prediction graph is an image marked with different street view characteristics by using different colors, and the street view characteristics of the same type are marked by using the same color; training the bidirectional attention network segmentation model by using a training data set; collecting street view images to be segmented and inputting the street view images into a model to output a prediction graph; and overlapping the prediction graph with the street view image original graph to be segmented to generate a final street view segmentation and overlapping image. The method can improve the segmentation precision of the street view image and ensure the safe driving of the automatic driving automobile.

Description

Street view image segmentation method and system based on bidirectional attention network

Technical Field

The invention relates to the field of image segmentation processing, in particular to a street view image segmentation method and system based on a bidirectional attention network.

Background

The image segmentation is an image processing method which divides an image into a plurality of mutually disjoint areas according to the characteristics of the image, such as color, texture, shape and the like, so that the characteristics in the same area show consistency or similarity, and different areas show obvious difference. Among them, the street view image segmentation technology for road scenes is one of the core technologies in the field of automatic driving.

The semantic segmentation aiming at the road scene is to divide each pixel in the collected road scene image into corresponding categories so as to realize the classification of the street view image on the pixel level. In fact, the accuracy in street view image segmentation is affected by different driving areas, and firstly the dissimilarity of different target objects and the similarity of similar target objects are overcome, secondly the complexity of the scene where the segmented object is located is noticed, and finally external factors such as illumination, shooting conditions, shooting equipment and shooting distance are noticed, and the external factors can cause the difference between the target object and the picture to be large, so that the accuracy of image segmentation is affected. These factors greatly improve the difficulty of image segmentation, so that the segmentation precision of a complex street view image is low, and the automatic driving vehicle cannot acquire accurate road condition information according to the characteristics of the street view image, thereby influencing the realization of the automatic driving technology.

Therefore, how to improve the segmentation precision of street view images, so as to provide accurate road condition information for the automatic driving automobile according to the characteristics of the segmented street view images, and ensure that the automatic driving automobile can safely drive, is a problem to be solved urgently at present.

Disclosure of Invention

The invention aims to provide a street view image segmentation method and system based on a bidirectional attention network, which can improve the accuracy of street view image segmentation, thereby providing accurate road condition information for an automatic driving automobile, further ensuring that the automatic driving automobile can safely drive, and solving the problems that the automatic driving automobile cannot acquire the accurate road condition information and is unsafe to drive due to low segmentation accuracy of the street view image in the prior art.

In order to achieve the purpose, the invention provides the following scheme:

on one hand, the invention provides a street view image segmentation method based on a bidirectional attention network, which comprises the following steps:

acquiring urban street image data, and constructing a training data set after preprocessing;

establishing a bidirectional attention network segmentation model; the bidirectional attention network segmentation model comprises a multi-scale fusion module, a semantic information extraction network with an attention mechanism and a spatial information extraction network with the attention mechanism, wherein the multi-scale fusion module is used for fusing semantic information extracted by the semantic information extraction network and spatial information extracted by the spatial information extraction network; the bidirectional attention network segmentation model takes a street view image as input and takes a prediction image corresponding to the street view image as output; the prediction graph is an image with different street view features marked by different colors, and the same type of street view features are marked by the same color;

Training the bidirectional attention network segmentation model by using the training data set to obtain a well-trained bidirectional attention network segmentation model;

collecting street view images to be segmented, inputting the street view images to be segmented into a trained bidirectional attention network segmentation model, and obtaining a prediction map corresponding to the street view images to be segmented;

and superposing the prediction graph and the street view image to be segmented to generate a final street view segmentation superposed image.

Optionally, the acquiring image data of urban streets, and constructing a training data set after preprocessing includes:

acquiring an urban street video by using a vehicle-mounted camera, and extracting a plurality of urban street images at different moments from the urban street video data according to a preset frequency;

determining all street view feature categories to be semantically segmented under the urban street scene according to the extracted urban street images;

and marking street view features in each city street image by using an open source toolkit Vott according to the street view feature category to be semantically segmented.

Optionally, the establishing a bidirectional attention network segmentation model specifically includes:

building a ResNet18 neural network as a semantic information extraction network by adopting a Pythroch frame, and pre-training the ResNet18 neural network by utilizing an open source data set ImgeNet to obtain and store the optimal parameters of the ResNet18 neural network;

Replacing the last FC layer of the AlexNet network with a Sigmoid layer based on the existing AlexNet network to obtain the spatial information extraction network; the Sigmoid layer is used for outputting spatial information and semantic information;

converting the image characteristics input into the attention module into three groups of vectors Q, K and V by adopting a linear transformation method, and calculating an attention weight matrix of the vector Q and the vector K; after the attention weight matrix is normalized by adopting a Relu function, giving weight to the vector V to complete attention weighting;

and calculating the cross attention of the spatial information and the semantic information, normalizing the cross attention as weight to endow the semantic information, upsampling the weighted semantic information to the size same as that of the spatial information, and performing accumulation fusion on the semantic information and the spatial information to obtain the multi-scale fusion module.

Optionally, after the semantic information extraction network and the spatial information extraction network are obtained, a Dice Loss function is used to train the semantic information extraction network, and a cross entropy Loss function is used to train the spatial information extraction network.

Optionally, the two categories defined by the Dice Loss function are represented as:

x and Y respectively represent input and output of a network, | X |, N |, Y | represents an intersection between X and Y, and | X | and | Y | respectively represent the number of pixel points in a set X and Y;

The cross entropy loss function is defined by two categories as follows:

wherein p represents the predicted probability of each street view feature class, y_iA label representing a sample i, the positive class being 1 and the negative class being 0; p is a radical of_iRepresenting the probability that sample i is predicted as a positive class;

the final loss function is then expressed as:

wherein L is_y(X; W) represents a loss function of semantic information,

a loss function representing spatial information, N representing the number of layers of spatial information, and W representing a predicted value of the partition network.

Optionally, after the step of acquiring the video of the city street by using the vehicle-mounted camera and extracting a plurality of city street images at different times from the video data of the city street according to the preset frequency, the method further includes:

and expanding each extracted urban street image by adopting a random rotation angle rotation mode, a horizontal or vertical turning mode and a random cutting mode.

Optionally, the streetscape feature categories to be semantically segmented include highways, sidewalks, parking lots, railways, people, carts, trucks, buses, trains, motorcycles, bicycles, caravans, trailers, buildings, walls, fences, guardrails, bridges, tunnels, poles, traffic signs, traffic lights, foliage, sky, and others.

Optionally, the labeling format is json format of the COCO dataset.

Optionally, before the step of training the bidirectional attention network segmentation model by using the training data set, the method further includes:

and performing data enhancement processing of turning transformation, color dithering, translation transformation and contrast transformation on the training data set.

On the other hand, the invention also provides a street view image segmentation system based on the bidirectional attention network, which comprises the following components:

the training data set construction module is used for acquiring image data of urban streets and constructing a training data set after preprocessing;

the model establishing module is used for establishing a bidirectional attention network segmentation model; the bidirectional attention network segmentation model comprises a multi-scale fusion module, a semantic information extraction network with an attention mechanism and a spatial information extraction network with the attention mechanism, wherein the multi-scale fusion module is used for fusing semantic information extracted by the semantic information extraction network and spatial information extracted by the spatial information extraction network; the bidirectional attention network segmentation model takes street view images as input and takes prediction images corresponding to the street view images as output; the prediction graph is an image with different street view features marked by different colors, and the same type of street view features are marked by the same color;

The model training module is used for training the bidirectional attention network segmentation model by utilizing the training data set to obtain a trained bidirectional attention network segmentation model;

the model application module is used for acquiring street view images to be segmented and inputting the street view images to be segmented into the trained bidirectional attention network segmentation model to obtain a prediction graph corresponding to the street view images to be segmented;

and the prediction graph and original graph superposition module is used for superposing the prediction graph and the street view image to be segmented to generate a final street view segmentation superposed image.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a street view image segmentation method and system based on a bidirectional attention network, which are characterized in that a segmentation model based on the bidirectional attention network is constructed, semantic information and spatial information in a street view image are respectively extracted in a targeted manner by utilizing a semantic information extraction network and a spatial information extraction network, the accuracy of the semantic information and the spatial information in the street view image extracted by the model can be ensured by utilizing an attention mechanism, and the segmentation precision of the bidirectional attention network segmentation model is further improved. And then fusing semantic information and spatial information by using a multi-scale fusion module, finally outputting a prediction graph, wherein the prediction graph has a color corresponding to each type of street view characteristics in the street view image, marking the street view characteristics of different types, and finally overlapping and fusing the prediction graph and the original graph, so that the segmentation effect is better, the segmentation accuracy and the segmentation effect of the street view image can be improved, accurate road condition information can be provided for the automatically-driven automobile, the automatically-driven automobile can accurately identify people or objects in the road condition of the complex street view, the safe driving of the automatically-driven automobile can be ensured, and the problem that the automatically-driven automobile cannot obtain accurate road condition information and is unsafe to drive due to low segmentation accuracy of the street view image in the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The following drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 is a flowchart of a street view image segmentation method according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a street view image segmentation method according to embodiment 1 of the present invention;

fig. 3 is a schematic structural diagram of a bidirectional attention network segmentation model provided in embodiment 1 of the present invention;

fig. 4 is a schematic structural diagram of an attention mechanism provided in embodiment 1 of the present invention;

fig. 5 is a schematic structural diagram of a multi-scale fusion module provided in embodiment 1 of the present invention;

fig. 6 is a block diagram of a street view image segmentation system according to embodiment 2 of the present invention.

The reference numbers illustrate:

m1-training data set construction module; m2-model building module; m3-model training module; m4-model application module; m5-predict graph and original graph overlay module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

As used in this disclosure and in the claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" are intended to cover only the explicitly identified steps or elements as not constituting an exclusive list and that the method or apparatus may comprise further steps or elements.

Although the present invention makes various references to certain modules in a system according to embodiments of the present invention, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative and different aspects of the systems and methods may use different modules.

Flowcharts are used in the present disclosure to illustrate the operations performed by the system according to embodiments of the present invention. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Meanwhile, other operations may be added to or removed from these processes.

In deep learning, image segmentation may be represented by a pixel classification problem with semantic labels (semantic segmentation) or single object segmentation (instance segmentation). Semantic segmentation performs pixel-level labeling using a set of object classes (e.g., people, cars, trees, sky, etc.) for all image pixels. Thus, this approach is generally more difficult than image classification than predicting a single marker for an entire street view image. Example segmentation further expands the scope of semantic segmentation by detecting and delineating each object of interest in the image.

The existing image segmentation methods mainly comprise a threshold-based segmentation method, an edge-based segmentation method, a region-based segmentation method, unsupervised clustering, deep learning and other segmentation methods based on a specific theory. The threshold method, the edge detection method, the region growing method and the clustering method have the advantages, but more human-computer interaction is needed to complete target extraction in the image segmentation process, and the method has poor noise resistance and interference resistance, weak self-learning capability and poor segmentation accuracy. Therefore, these conventional image segmentation methods are not suitable for the field of unmanned driving, and cannot meet the requirements of unmanned driving on accuracy and real-time performance.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

As shown in fig. 1 and fig. 2, the present embodiment provides a street view image segmentation method based on a bidirectional attention network, which specifically includes the following steps:

and step S1, collecting image data of urban streets, and constructing a training data set after preprocessing. The method specifically comprises the following steps:

s1.1, collecting urban street videos by using a vehicle-mounted camera, and extracting a plurality of urban street images at different moments from the urban street video data according to a preset frequency.

In this embodiment, after the step of acquiring the video of the city street by using the vehicle-mounted camera and extracting a plurality of city street images at different times from the video data of the city street according to the preset frequency, the method may further include the step S1.2:

And S1.2, expanding each extracted urban street image by adopting a random rotation angle rotation mode, a horizontal or vertical turning mode and a random cutting mode.

And S1.3, determining all street view feature categories to be semantically segmented under the scene of the city street according to the expanded image of the city street.

In this embodiment, the street view feature categories to be semantically segmented include highways, sidewalks, parking lots, railways, people, carts, trucks, buses, trains, motorcycles, bicycles, caravans, trailers, buildings, walls, fences, guardrails, bridges, tunnels, poles, traffic signs, traffic lights, plants, sky, and others, and can be set according to actual conditions of different streets.

And S1.4, according to the street view feature category to be semantically segmented, marking the street view features in each city street image by using an open source toolkit Vott to obtain the training data set.

In this embodiment, the training data set includes a training set, a verification set, and a test set, which are respectively used for training the bidirectional attention network segmentation model and verifying and testing the segmentation effect of the model. The function of labeling the street view feature category is to make a real label (ground route) of the street view image for network training and testing the network segmentation effect.

In this embodiment, when the street view feature is labeled, the labeling format adopts a json format of a COCO data set, and may also adopt a Pascal Voc data set format.

After the training data set is obtained, the method can further comprise the following steps of S1.5:

and S1.5, performing data enhancement processing of turning transformation, color dithering, translation transformation and contrast transformation on the training data set. By carrying out data cleaning and data enhancement processing on the training data set, the robustness of the training data can be effectively improved, and the segmentation precision of the bidirectional attention network segmentation model is further improved.

And step S2, establishing a bidirectional attention network segmentation model.

In this embodiment, the bidirectional attention network segmentation model includes a multi-scale fusion module, a semantic information extraction network with an attention mechanism, and a spatial information extraction network with an attention mechanism, where the multi-scale fusion module is configured to fuse semantic information extracted by the semantic information extraction network and spatial information extracted by the spatial information extraction network; the bidirectional attention network segmentation model takes a street view image as input and takes a prediction image corresponding to the street view image as output; the prediction graph is an image marked with different street view features by using different colors, and the same type of street view features are marked by using the same color.

The bidirectional attention network segmentation model established by the invention is shown in fig. 3, and the construction process of the bidirectional attention network segmentation model specifically comprises the following steps:

and S2.1, constructing a semantic information extraction network. The method specifically comprises the following steps:

a ResNet18 neural network is built by adopting a Pythroch frame to serve as a semantic information extraction network, pre-training is carried out on the ResNet18 neural network by utilizing an open source data set ImgeNet, and the optimal parameters of the ResNet18 neural network are obtained and stored.

And S2.2, constructing a spatial information extraction network. The method specifically comprises the following steps:

replacing the last fully-connected FC layer of the AlexNet network with a Sigmoid layer based on the existing AlexNet network to obtain the spatial information extraction network; and the Sigmoid layer is used for outputting spatial information and semantic information.

Because the probability of street view feature categories is generated by the full-connection FC layer, and the task of the invention is to generate a probability map (mask) after street view image segmentation, the FC layer needs to be removed, the FC layer can be directly deleted, and the Sigmoid layer is used as an output layer to replace the deleted FC layer.

And S2.3, constructing an attention mechanism. Since the attention mechanism is established based on the attention module, the attention mechanism specifically comprises the following steps:

Converting the image characteristics input into the attention module into three groups of vectors Q, K and V by adopting a linear transformation method, and calculating an attention weight matrix of the vector Q and the vector K; and after the attention weight matrix is normalized by adopting a Relu function, giving a weight to the vector V to finish attention weighting. In the attention module, a whitened pair W is used_kAnd W_qRepresenting the relationship between two pixels, using a single unitary term W_vThe saliency of each pixel is represented as shown in fig. 4.

And S2.4, constructing a multi-scale fusion module. The method specifically comprises the following steps:

and calculating the cross attention of the spatial information and the semantic information, normalizing the cross attention as weight to endow the semantic information, upsampling the weighted semantic information to the size same as that of the spatial information, and performing accumulation fusion on the semantic information and the spatial information to obtain the multi-scale fusion module. The structure of the multi-scale fusion module is shown in fig. 5.

According to the method, a self-attention structure and a convolutional neural network are combined to construct a semantic information extraction network, and the inductive biasing effect of the convolutional neural network on the picture is maintained while the self-attention mechanism is maintained to obtain the long-distance dependency relationship. Secondly, a spatial information extraction network is constructed by utilizing a channel attention mechanism, and spatial information is extracted independently. And finally, the spatial information and the semantic information are fused by utilizing a multi-scale fusion module, so that the perfect combination of global semantics and detail characteristics is realized, and the high segmentation precision and the segmentation effect of the bidirectional attention network segmentation model are ensured.

And S3, training the bidirectional attention network segmentation model by using the training data set to obtain the trained bidirectional attention network segmentation model.

After a bidirectional attention network segmentation model is established, the bidirectional attention network segmentation model is trained by adopting a multi-scale fusion training strategy, and the multi-scale fusion training strategy specifically comprises the following steps:

s3.1, training the semantic information extraction network by adopting a Dice Loss function as a main Loss function; meanwhile, a cross entropy loss function is adopted as a secondary loss function, and the spatial information extraction network is trained.

The Dice Loss function is used for evaluating similarity measurement between two samples, the value range is 0-1, and the larger the function value is, the higher the similarity of the two samples is. The two categories defined by the Dice Loss function are expressed as follows:

wherein, X and Y respectively represent the input and output of the network, | X # Y | represents the intersection between X and Y, | X | and | Y | respectively represent the number of pixel points in the set X and Y.

The cross-entropy function is defined as a measure of the difference between two probability distributions for a given set of random variables or events. The cross entropy function is widely used for classification tasks, and the image segmentation is used for classifying the pixel level of image features, so that the cross entropy function has a good effect when being subjected to binary classification. The cross entropy loss function is defined by two categories as:

Wherein p represents the predicted probability of each street view feature class, y_iA label representing a sample i, with a positive class of 1 and a negative class of 0; p is a radical of formula_iIndicating the probability that sample i is predicted as positive.

The final Loss function obtained by combining the Dice Loss function and the cross entropy function is represented as:

wherein L is_y(X; W) represents a loss function of semantic information,

S3.2, training a network by adopting a Stochastic Gradient Descent (SGD) method, wherein the initial learning rate is set to be 0.0001, the kinetic energy is 0.9, and the weight attenuation is 0.8.

It is to be understood that the specific values such as the initial learning rate, the kinetic energy value, and the weight attenuation value set in the present embodiment are only a set of values listed for example, and may be set to other values, which may be set according to actual situations.

And S3.3, adjusting the learning rate by adopting a Warmup learning rate adjustment strategy, and resetting the learning rate every 10 rounds (epoch). By dynamically adjusting the learning rate, the method can better adapt to the training process of network parameters, accelerate the convergence speed of the bidirectional attention network segmentation model and improve the calculation efficiency of the neural network algorithm.

After a multi-scale fusion training strategy is determined, training a bidirectional attention network segmentation model by using a training data set, which specifically comprises the following steps:

and S3.4, dividing the training data set into a training set, a verification set and a test set according to a preset proportion.

The preset ratio in this embodiment is 7: 2: 1, i.e. according to 7: 2: the training data set is divided into a training set, a verification set and a test set according to the proportion of 1, and the preset proportion can be set according to actual conditions.

Step S3.5, the Batch data (Batch Size) is preferably 32, that is, 32 street view pictures are input into the model for each training, and when 32 pictures are trained in the model once, the training of one Batch is considered to be completed. When all street view picture samples in the training dataset are trained once, the training of an epoch is considered to be completed, and the epoch is set to 5000, which is a preferred value, and may also be set to other values.

And step 3.6, when the maximum iteration number is reached, namely the epoch is 5000, stopping training at the moment, storing the training parameters of the bidirectional attention network segmentation model at the moment, and obtaining the trained bidirectional attention network segmentation model.

According to the invention, by establishing the bidirectional attention network segmentation model and utilizing the bidirectional attention network segmentation network to simplify a complex end-to-end network structure into the semantic information extraction network and the spatial information extraction network with the attention mechanism, the model structure is simpler, the training process of the bidirectional attention network segmentation model is effectively simplified, the real-time reasoning speed and the segmentation precision of the bidirectional attention network segmentation model are improved, and the adaptability of the bidirectional attention network segmentation model in a complex street scene is improved.

It should be noted that, in addition to the bidirectional attention network segmentation model adopted in the present application, an existing segmentation model based on an FCN network, a deplab v1 network, a U-net network, or a SegNet network may be adopted. The FCN network segmentation model uses a pooling layer in a convolutional neural network, the pooling layer can enable the convolutional neural network to increase the receptive field and perform fusion characteristics at the same time, but the FCN network segmentation model has the defect that continuous down-sampling can cause image characteristic details to be lost, so that the segmentation result is greatly influenced. The deep lab v1 network segmentation model can apply the hole convolution to the VGG16 network, convert the full connection layer of the VGG16 network into convolutional layers, and adjust all convolutional layers behind the fourth and fifth pooling layers in the VGG16 network into hole convolutions with different expansion rates, so as to recover the receptive field to the size of the original image, thereby improving the accuracy of model segmentation. The U-Net network segmentation model has different and mutually matched coding structure and decoding structure, and can play the roles of improving image characteristic details and recovering segmentation effect, but has the defects that only 2D images can be processed, and a V-Net network provided on the basis of the U-Net network can process 3D scenes. The SegNet network segmentation model adopts a VGG16 network, can output a dense feature map by using the VGG16 network, and realizes recovery after segmentation of the dense map by convolution calculation of a sparse image. The segmentation models of different networks can replace the two-way attention network segmentation model in the application, have different image segmentation effects, and can be selected and applied according to actual requirements.

And step S4, collecting street view images to be segmented, inputting the street view images to be segmented into a trained bidirectional attention network segmentation model, and obtaining a prediction graph corresponding to the street view images to be segmented.

In this embodiment, after the trained bidirectional attention network segmentation model is obtained, street view images to be segmented are collected in real time, and the street view images to be segmented can be adjusted to a preset size so as to be directly input into the trained bidirectional attention network segmentation model, and then corresponding prediction maps are output, where the prediction maps actually predict the categories of street view features, and the street view features of the categories are segmented and labeled from the whole street view image, and are mainly embodied as that different colors are used to frame different street view features in the prediction maps.

And step S5, overlapping the prediction graph and the street view image to be segmented to generate a final street view segmentation and overlapping image.

In the embodiment, the final street view segmentation and superposition image is presented in a mode of superposing the street view image original image to be segmented and the prediction image, different people or objects in the street real scene are more three-dimensionally and accurately displayed by combining different types of street view features framed in different colors on the basis of the street view image original image, and after the characteristics of the people or objects are accurately identified by the automatic driving automobile, the automatic driving automobile can be ensured to safely and stably drive, and the development of the unmanned driving technology is facilitated.

Example 2

As shown in fig. 6, the present embodiment provides a street view image segmentation system based on a bidirectional attention network, which corresponds to the street view image segmentation method in embodiment 1, and the system specifically includes:

the training data set construction module M1 is used for acquiring urban street image data, and constructing a training data set after preprocessing;

the model building module M2 is used for building a bidirectional attention network segmentation model; the bidirectional attention network segmentation model comprises a multi-scale fusion module, a semantic information extraction network with an attention mechanism and a spatial information extraction network with the attention mechanism, wherein the multi-scale fusion module is used for fusing semantic information extracted by the semantic information extraction network and spatial information extracted by the spatial information extraction network; the bidirectional attention network segmentation model takes street view images as input and takes prediction images corresponding to the street view images as output; the prediction graph is an image with different street view features marked by different colors, and the same type of street view features are marked by the same color;

the model training module M3 is configured to train the bidirectional attention network segmentation model by using the training data set, so as to obtain a trained bidirectional attention network segmentation model;

The model application module M4 is used for collecting street view images to be segmented and inputting the street view images to be segmented into the trained bidirectional attention network segmentation model to obtain a prediction map corresponding to the street view images to be segmented;

and the prediction graph and original graph overlaying module M5 is used for overlaying the prediction graph and the street view image to be segmented to generate a final street view segmentation overlaid image.

The invention provides a street view image segmentation method and system based on a bidirectional attention network, which comprises the steps of firstly, collecting urban street images, preprocessing the urban street images and constructing a training data set; secondly, establishing a bidirectional attention network segmentation model, optimizing a loss function during network training, updating and optimizing the model, and obtaining the optimal model weight; and finally, re-acquiring the street view image to be segmented, segmenting the street view image to be segmented by using the trained bidirectional attention network segmentation model, and superposing the predicted image and the street view image to be segmented to generate a final street view segmentation superposed image. The method can be applied to the field of unmanned driving, improves the accuracy of street view image segmentation, and provides more accurate data for the subsequent decision of the visual perception system of the automatic driving automobile, thereby improving the adaptability of the visual perception system of the automatic driving automobile to a complex environment and the accuracy and rapidity of segmentation of pedestrians, vehicles, traffic signs and the like, and ensuring the safe driving of the automatic driving automobile.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A street view image segmentation method based on a bidirectional attention network is characterized by comprising the following steps:

acquiring image data of urban streets, and constructing a training data set after preprocessing;

And overlapping the prediction graph and the street view image to be segmented to generate a final street view segmentation overlapped image.

2. The streetscape image segmentation method according to claim 1, wherein the collecting city street image data, the preprocessing, and then constructing the training data set specifically include:

3. The streetscape image segmentation method according to claim 1, wherein the establishing of the bidirectional attention network segmentation model specifically comprises:

building a ResNet18 neural network as a semantic information extraction network by adopting a Pythrch frame, and pre-training the ResNet18 neural network by utilizing an open source data set ImgeNet to obtain and store the optimal parameters of the ResNet18 neural network;

Converting the input features into three groups of vectors Q, K and V by adopting a linear transformation method, and calculating an attention weight matrix of the vector Q and the vector K; after the attention weight matrix is normalized by adopting a Relu function, giving weight to the vector V to complete attention weighting;

4. The streetscape image segmentation method according to claim 3, wherein after the semantic information extraction network and the spatial information extraction network are obtained, the semantic information extraction network is trained by a Dice Loss function, and the spatial information extraction network is trained by a cross entropy Loss function.

5. The street view image segmentation method according to claim 4, wherein:

the two categories defined by the Dice Loss function are expressed as follows:

The cross entropy loss function is defined by two categories as follows:

where p represents the predicted probability of each street view feature class, y_iLabel representing sample i, positive class 1, negativeClass is 0; p is a radical of_iRepresenting the probability that sample i is predicted as a positive class;

the final loss function is then expressed as:

wherein L is_y(X; W) represents a loss function of semantic information,

6. The streetscape image segmentation method according to claim 2, wherein after the step of acquiring the city street video by using the vehicle-mounted camera and extracting a plurality of city street images at different times from the city street video data according to a preset frequency, the method further comprises:

7. The streetscape image segmentation method of claim 2, wherein the streetscape feature classes to be semantically segmented include highways, sidewalks, parking lots, railways, people, cars, trucks, buses, trains, motorcycles, bicycles, caravans, trailers, buildings, walls, fences, guardrails, bridges, tunnels, poles, traffic signs, traffic lights, foliage, sky, and others.

8. The street view image segmentation method as claimed in claim 2, wherein the annotation format is json format of COCO data set.

9. The streetscape image segmentation method according to any one of claims 1 to 8, further comprising, before the step of training the bi-directional attention network segmentation model using the training data set:

10. A street view image segmentation system based on a bidirectional attention network, comprising:

the model establishing module is used for establishing a bidirectional attention network segmentation model; the bidirectional attention network segmentation model comprises a multi-scale fusion module, a semantic information extraction network with an attention mechanism and a spatial information extraction network with the attention mechanism, wherein the multi-scale fusion module is used for fusing semantic information extracted by the semantic information extraction network and spatial information extracted by the spatial information extraction network; the bidirectional attention network segmentation model takes a street view image as input and takes a prediction image corresponding to the street view image as output; the prediction graph is an image with different street view features marked by different colors, and the same type of street view features are marked by the same color;