CN115063446A

CN115063446A - City street view example segmentation method of driving assistance system

Info

Publication number: CN115063446A
Application number: CN202210517170.1A
Authority: CN
Inventors: 林珊玲; 赵敬伟; 林志贤; 郭太良; 叶芸; 张永爱; 林坚普; 梅婷; 王利翔; 吴宇航
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-09-16
Anticipated expiration: 2042-05-12
Also published as: CN115063446B

Abstract

The invention provides a city street view example segmentation method of an auxiliary driving system, which is characterized in that on a residual error network of a city street view example segmentation model, three sawtooth-shaped mixed cavity convolution structures with different expansion rate combinations are fused on C4 and C5 feature layers to obtain a primary feature extraction network, more bottom layer feature information is obtained by adding a bottom layer feature layer P2 and a feature pyramid of a feature information fusion network N2-N6, an attention network is added before the network is predicted to enhance useful features and weaken the influence of irrelevant features, and then end-to-end training and early warning processing are carried out on the model; the city street view example segmentation model comprises a feature enhancement network, a preliminary feature extraction network, a feature pyramid enhancement network, a mask acquisition network, an attention network and a prediction network which are sequentially connected. By the technical scheme, the more accurate example segmentation effect on small target objects and shielding objects can be improved while the city street view example segmentation is completed.

Description

City street view example segmentation method of driving assistance system

Technical Field

The invention relates to the technical field of image processing, in particular to a city street view example segmentation method of an auxiliary driving system.

Background

The city street view detection is a core algorithm of an intelligent vehicle auxiliary system, the accuracy and the real-time performance of pedestrian and vehicle detection directly influence the intelligent vehicle auxiliary system, the safety of vehicles is directly involved, and the judgment of the auxiliary system is influenced by wrong detection, so that potential safety hazards are caused. The example segmentation is to independently segment pixel regions of different objects in the scene image and judge the category to which each pixel region belongs. Example segmentation is a computer vision task closest to human real vision perception, and has high application value particularly in the field of automatic driving, such as detecting lane lines, pedestrians, vehicles, obstacles and the like through example segmentation to guide automatic driving.

Scholars at home and abroad put forward a lot of schemes from image processing to example segmentation. Most of the excellent algorithms in the schemes start from a two-stage target detection algorithm, feature selection is performed on a shared feature layer through a region candidate frame, and then judgment and identification are performed on objects in an image. In these researches in the last two years, some more excellent algorithms start from a single-stage detection algorithm, and an RPN for two-stage target detection is not adopted.

Yolcat is a relatively excellent single-stage detection algorithm at present, which has relatively high operation speed and real-time performance, but in a complex scene (such as existence of occlusion, a small target object and the like), the accuracy of the example segmentation algorithm is affected by the combination of target detection and identification and semantic segmentation, and a very satisfactory effect cannot be achieved.

Disclosure of Invention

In view of this, the present invention provides a city street view example segmentation method for a driving assistance system, so as to improve an example segmentation effect on a small target object and an occlusion object more accurately while completing the city street view example segmentation.

In order to achieve the purpose, the invention adopts the following technical scheme: a city street view example segmentation method of an auxiliary driving system comprises the steps of fusing three sawtooth-shaped mixed cavity convolution structures with different expansion rate combinations to C4 and C5 feature layers on a residual error network of a city street view example segmentation model to obtain a primary feature extraction network, obtaining more bottom feature information together with a feature pyramid which is added with a bottom feature layer P2 and a feature information fusion network N2-N6, adding an attention network before predicting the network to enhance useful features and weaken the influence of irrelevant features, and then carrying out end-to-end training and early warning on the model; the city street view example segmentation model comprises a feature enhancement network, a preliminary feature extraction network, a feature pyramid enhancement network, a mask acquisition network, an attention network and a prediction network which are sequentially connected.

In a preferred embodiment: the C4 and C5 feature layer fusion three sawtooth-shaped mixed cavity convolution structures with different expansion rate combinations specifically include: the C4 comprises 23 blocks, the first block adopts convolution kernel of 3X3 and the step length of 2, the feature diagram obtained by C3 is convoluted, the width and the height of the feature diagram are reduced to half of the original width and height, then three blocks with convolution kernel of 3X3, step length of 1 and expansion rates of 1, 2 and 3 are used as one group, and the seven groups are totally combined, and then the obtained feature information is transmitted into the block with convolution kernel of 3X3, step length of 2 and expansion rate of 2; the C5 comprises 3 blocks, the first block mainly adopts a convolution kernel of 3X3 and has a step length of 2, the feature diagram obtained by the C4 is subjected to convolution operation, the width and the height of the feature diagram are reduced to half of the original width and height, and then the obtained feature information is transmitted into the blocks of which the convolution kernel is 3X3, the step length is 1 and the expansion rates are 2 and 3 respectively.

In a preferred embodiment: the preliminary feature extraction network is based on a deep residual error network ResNet-101.

In a preferred embodiment: the attention network consists of two branches, wherein the first branch is an input feature map and the second branch is used for extracting the weight of a feature map channel; firstly, performing global pooling on an input feature map; then performing convolution by two 1x1 and activating by using a Relu function; and then, carrying out normalization operation, mapping the characteristic value between 0 and 1 through a sigmoid function, and finally multiplying the channels corresponding to the two branches to obtain an output characteristic diagram of the attention network.

In a preferred embodiment: the end-to-end training is to label the pixel points of the categories set in the image and the positions of the target examples of the categories, take the labeled data as a training set, and input the training set into the city street view example segmentation model for learning and training, so as to obtain the trained city street view example segmentation model.

In a preferred embodiment: the early warning processing comprises the steps of inputting the image or video data to be detected into a trained city street view example segmentation model, obtaining the category, the position and the segmentation result of the required example in the image, judging the road condition according to the category, the position and the segmentation result of the example, and sending out early warning information.

In a preferred embodiment: the feature enhancement is to preprocess the input image or video data, specifically including random Gaussian denoising, random contrast enhancement and random clipping to the input image data.

In a preferred embodiment: the feature pyramid strengthening network comprises a feature pyramid network FPN, a bottom layer feature layer P2 and a bottom-up feature information fusion network N2-N6; the feature pyramid network samples the topmost feature map C5 of the preliminary feature extraction network, and adds the topmost feature map C5 with the second highest feature map C4 of the preliminary feature extraction network to form a layer of P4 of a feature pyramid, and each layer of the feature pyramid is constructed layer by layer from top to bottom; the added lowest-layer feature map P2 is downsampled by the feature information fusion network and is added with the second lowest-layer feature map P3 in the feature pyramid path to form a layer N3 of the feature information fusion network, and each layer of feature information fusion is built layer by layer from bottom to top.

In a preferred embodiment: and the mask acquisition network is used for extracting masks from the feature map N2 at the bottom layer of the feature pyramid strengthening network.

In a preferred embodiment: the prediction network is used for classifying and positioning the characteristic information output by the attention network, and then linearly adding the mask information output by the mask acquisition network to obtain the final image segmentation result.

Compared with the prior art, the invention has the following beneficial effects:

(1) when a small target object appears in the image or a shielding phenomenon exists, the method can effectively segment the small target object and the shielded object;

(2) on the basis of the YOLACT algorithm, a primary feature extraction network is fused into a zigzag mixed cavity convolution structure with three different expansion rate combinations, so that more image information is prevented from being lost due to downsampling in the process of obtaining image features, a larger receptive field is obtained under the condition that a convolution kernel is not changed, the capturing capacity of a target is improved, and the segmentation precision is improved; on the basis of the feature pyramid network, a bottom layer feature acquisition network and a feature information fusion network are additionally arranged, so that the utilization rate of bottom layer feature information is improved, the bottom layer feature information is effectively contained in top layer feature information, and main information in an image is acquired more efficiently under the action of an attention network.

Drawings

FIG. 1 is a schematic diagram of an example city street view segmentation model framework according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a preferred embodiment C4 hybrid hole convolution network structure;

FIG. 3 is a schematic diagram of the structure of the preferred embodiment C5 hybrid hole convolution network;

FIG. 4 is a schematic diagram of the specific steps of the feature pyramid enhanced network connection of the preferred embodiment;

fig. 5 is a schematic view of the attention network structure of the preferred embodiment.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

A city street view example segmentation method of an auxiliary driving system refers to the following steps that referring to the graphs in FIGS. 1 to 5, on a residual error network of a city street view example segmentation model, a C4 feature layer and a C5 feature layer are fused with three sawtooth-shaped mixed cavity convolution structures with different expansion rate combinations to obtain a primary feature extraction network, more bottom feature information is obtained by adding a bottom feature layer P2 and a feature pyramid of a feature information fusion network N2-N6, an attention network is added before a network is predicted to enhance useful features and weaken the influence of irrelevant features, and then end-to-end training and early warning processing are carried out on the model; the city street view example segmentation model comprises a feature enhancement network, a preliminary feature extraction network, a feature pyramid enhancement network, a mask acquisition network, an attention network and a prediction network which are sequentially connected.

Three zigzag mixed cavity convolution structures with different expansion rate combinations are fused to the C4 and C5 characteristic layers, and the method specifically comprises the following steps: the C4 comprises 23 blocks, the first block adopts a convolution kernel of 3X3 and the step length of 2, the feature diagram obtained by C3 is subjected to convolution operation, the width and the height of the feature diagram are reduced to half of the original width and height, then the three blocks with the convolution kernel of 3X3, the step length of 1 and the expansion rates of 1, 2 and 3 are used as one group, and the groups are seven, and then the obtained feature information is transmitted into the block with the convolution kernel of 3X3, the step length of 2 and the expansion rate of 2; the C5 comprises 3 blocks, the first block mainly adopts a convolution kernel of 3X3 and has a step length of 2, the feature diagram obtained by the C4 is subjected to convolution operation, the width and the height of the feature diagram are reduced to half of the original width and height, and then the obtained feature information is transmitted into the blocks of which the convolution kernel is 3X3, the step length is 1 and the expansion rates are 2 and 3 respectively.

As shown in fig. 2 and fig. 3, many information loss due to downsampling in the process of obtaining image features is avoided, and a larger receptive field and denser image information are obtained under the condition that a convolution kernel is not changed; the method is used for convolving the input images layer by layer from bottom to top to obtain feature maps C1-C5 with different sizes.

The preliminary feature extraction network is based on a deep residual error network ResNet-101.

The attention network consists of two branches, wherein the first branch is an input feature map and the second branch is used for extracting the weight of a feature map channel; firstly, performing global pooling on an input feature map; then performing convolution by two 1x1 and activating by using a Relu function; and then, carrying out normalization operation, mapping the characteristic value between 0 and 1 through a sigmoid function, and finally multiplying the channels corresponding to the two branches to obtain an output characteristic diagram of the attention network.

The end-to-end training is to label the pixel points of the categories set in the image and the positions of the target examples of the categories, take the labeled data as a training set, and input the training set into the city street view example segmentation model for learning and training, so as to obtain the trained city street view example segmentation model. The invention uses a high-quality labeled city street scene data set City scenes as a training set for constructing a city street scene example segmentation model to carry out end-to-end training; the data set contains mainly street scenes from 50 different cities, with 5000 high quality pixel-level annotated images of driving scenes in a city environment, of which 2975 were used as training sets, 500 were used as verification sets, and 1525 were used as test sets, involving 19 categories of people, vehicles, etc.

The early warning processing comprises the steps of inputting the image or video data to be detected into a trained city street view example segmentation model, obtaining the category, the position and the segmentation result of the required example in the image, judging the road condition according to the category, the position and the segmentation result of the example, and sending out early warning information. The method uses the test set in the urban street scene data set City scenes with high-quality labels to test, inputs the test image into the trained urban street scene example segmentation model to obtain the category, the position and the segmentation result of the required example in the image, and then judges the road condition according to the condition of the example, thereby carrying out early warning prompt on the conditions of pedestrians, vehicles and the like.

The characteristic enhancement is to preprocess the input image or video data, and specifically comprises the steps of carrying out random Gaussian denoising, random contrast enhancement and random cutting on the input image data so as to enhance the detectability of related information, enhance the local detail characteristics of the acquired image and facilitate the capture of the deep characteristics of the image.

The feature pyramid reinforcement network comprises a Feature Pyramid Network (FPN) P3-P4, a bottom-level feature layer P2 and a bottom-up feature information fusion network N2-N6; the feature pyramid network samples the top feature map C5(P5) of the preliminary feature extraction network, and adds the top feature map C4 of the preliminary feature extraction network to form a layer P4 of a feature pyramid, and each layer of the feature pyramid is constructed layer by layer from top to bottom; the added lowest-layer feature map P2 is downsampled by the feature information fusion network and is added with the second lowest-layer feature map P3 in the feature pyramid path to form a layer N3 of the feature information fusion network, and each layer of feature information fusion is built layer by layer from bottom to top. The bottom-layer semantic information contains more details of the image, the small-size target can be better identified by increasing the utilization of the bottom-layer information, and therefore a P2 feature layer is added to obtain more bottom-layer feature information; the added lowest-layer feature map P2(N2) is downsampled by the feature information fusion network and added with the second-bottom-layer feature map P3 in the feature pyramid path to form a layer N3 of the feature information fusion network, and each layer of feature information fusion is built layer by layer from bottom to top. Obtaining feature maps N2-N6, as shown in FIG. 4;

and the mask acquisition network is used for extracting masks from the feature map N2 at the bottom layer of the feature pyramid strengthening network.

The prediction network is used for classifying and positioning the characteristic information output by the attention network, and then linearly adding the mask information output by the mask acquisition network to obtain the final image segmentation result.

The mask acquiring network and the prediction network do not adopt the RPN and feature repositioning step of two-stage target detection, no space semantic information is lost in the mask acquiring network, the segmentation is carried out by using the features containing the global semantic information, and the final segmentation result is more robust. The overall calculation cost is relatively low, and the efficient real-time characteristic of the SSD is reserved.

Claims

1. A city street view example segmentation method of an auxiliary driving system is characterized in that on a residual error network of a city street view example segmentation model, a C4 and C5 feature layer are fused with three sawtooth-shaped mixed cavity convolution structures with different expansion rate combinations to obtain a primary feature extraction network, a feature pyramid which is added with a bottom feature layer P2 and a feature information fusion network N2-N6 is added to obtain more bottom feature information, an attention network is added before the network is predicted to enhance useful features and weaken the influence of irrelevant features, and then end-to-end training and early warning processing are carried out on the model; the city street view example segmentation model comprises a feature enhancement network, a preliminary feature extraction network, a feature pyramid enhancement network, a mask acquisition network, an attention network and a prediction network which are sequentially connected.

2. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the C4 and C5 feature layer fusion three sawtooth-shaped mixed cavity convolution structures with different expansion rate combinations specifically include: the C4 comprises 23 blocks, the first block adopts a convolution kernel of 3X3 and the step length of 2, the feature diagram obtained by C3 is subjected to convolution operation, the width and the height of the feature diagram are reduced to half of the original width and height, then the three blocks with the convolution kernel of 3X3, the step length of 1 and the expansion rates of 1, 2 and 3 are used as one group, and the groups are seven, and then the obtained feature information is transmitted into the block with the convolution kernel of 3X3, the step length of 2 and the expansion rate of 2; the C5 comprises 3 blocks, the first block mainly adopts a convolution kernel of 3X3 and has a step length of 2, the feature diagram obtained by the C4 is subjected to convolution operation, the width and the height of the feature diagram are reduced to half of the original width and height, and then the obtained feature information is transmitted into the blocks of which the convolution kernel is 3X3, the step length is 1 and the expansion rates are 2 and 3 respectively.

3. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the preliminary feature extraction network is based on a deep residual error network ResNet-101.

4. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the attention network consists of two branches, wherein the first branch is an input feature map and the second branch is used for extracting the weight of a feature map channel; firstly, performing global pooling on an input feature map; then performing convolution by two 1x1 and activating by using a Relu function; and then, carrying out normalization operation, mapping the characteristic value between 0 and 1 through a sigmoid function, and finally multiplying the channels corresponding to the two branches to obtain an output characteristic diagram of the attention network.

5. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the end-to-end training is to label the pixel points of the categories set in the image and the positions of the target examples of the categories, take the labeled data as a training set, and input the training set into the city street view example segmentation model for learning and training, so as to obtain the trained city street view example segmentation model.

6. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the early warning processing comprises the steps of inputting the image or video data to be detected into a trained city street view example segmentation model, obtaining the category, the position and the segmentation result of the required example in the image, judging the road condition according to the category, the position and the segmentation result of the example, and sending out early warning information.

7. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the characteristic enhancement is to preprocess the input image or video data, and specifically comprises the steps of carrying out random Gaussian denoising, random contrast enhancement and random cutting on the input image data.

8. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the feature pyramid strengthening network comprises a feature pyramid network FPN, a bottom layer feature layer P2 and a bottom-up feature information fusion network N2-N6; the feature pyramid network samples the topmost feature map C5 of the preliminary feature extraction network, and adds the topmost feature map C5 with the second highest feature map C4 of the preliminary feature extraction network to form a layer of P4 of a feature pyramid, and each layer of the feature pyramid is constructed layer by layer from top to bottom; the added lowest-layer feature map P2 is downsampled by the feature information fusion network and is added with the second lowest-layer feature map P3 in the feature pyramid path to form a layer N3 of the feature information fusion network, and each layer of feature information fusion is built layer by layer from bottom to top.

9. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: and the mask acquisition network is used for extracting masks from the feature map N2 at the bottom layer of the feature pyramid strengthening network.

10. The city street view example segmentation method of the driving assistance system according to claim 1, wherein: the prediction network is used for classifying and positioning the characteristic information output by the attention network, and then linearly summing the mask information output by the mask acquisition network to obtain the final image segmentation result.