CN114419058A

CN114419058A - Image semantic segmentation model training method for traffic road scene

Info

Publication number: CN114419058A
Application number: CN202210103540.7A
Authority: CN
Inventors: 张帆; 曹松; 任必为; 宋君; 陶海
Original assignee: Beijing Vion Intelligent Technology Co ltd
Current assignee: Beijing Vion Intelligent Technology Co ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-04-29

Abstract

The invention provides an image semantic segmentation model training method for traffic road scenes, which comprises the following steps: constructing a semantic segmentation basic model, adjusting the structure of a basic network to form a semantic segmentation initial model, and training the semantic segmentation initial model by using a sample image training set of a traffic road scene to obtain an image semantic segmentation model. The invention solves the problem that the upsampling operator of the upsampling module of the image semantic segmentation model in the prior art is calculated by using the nearest interpolation mode; the feature map of the image semantic segmentation model output by the up-sampling module loses a large amount of pixel information compared with the original input image, the semantic segmentation performance of the image semantic segmentation model is influenced, and the accuracy of the final image semantic segmentation result is poor.

Description

Image semantic segmentation model training method for traffic road scene

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to an image semantic segmentation model training method for a traffic road scene.

Background

Image semantic segmentation is one of core research problems in the field of computer vision, and aims to allocate a label to each pixel of an input image, namely, to realize an object classification task at a pixel level.

In a traffic road scene, the application of an image semantic segmentation technology is wide, and the image semantic segmentation technology provides possibility for information perception in the traffic road scene by accurately analyzing and distinguishing objects such as drivable areas, pedestrians, vehicles and the like. In the prior art, in order to ensure good deployment adaptability between an image semantic segmentation model and a traffic road scene analysis platform, an upsampling module of the image semantic segmentation model usually uses a nearest interpolation mode for calculation; however, the early operator calculation mode results in that a large amount of pixel information is lost in comparison with an original input image in a feature map output by an up-sampling module of an image semantic segmentation model, and further, the semantic segmentation performance of the image semantic segmentation model is reduced, so that the accuracy of a final image semantic segmentation result is poor.

Disclosure of Invention

The invention mainly aims to provide an image semantic segmentation model training method for traffic road scenes, which aims to solve the problem that in the prior art, an upsampling operator of an upsampling module of an image semantic segmentation model is calculated by using a nearest neighbor interpolation mode; the feature map of the image semantic segmentation model output by the up-sampling module loses a large amount of pixel information compared with the original input image, the semantic segmentation performance of the image semantic segmentation model is influenced, and the accuracy of the final image semantic segmentation result is poor.

In order to achieve the above object, the present invention provides a training method for an image semantic segmentation model of a traffic road scene, which includes: step S1, constructing a semantic segmentation basic model with a basic network combined by a DeepLabV3plus network and a ResnetX network, wherein an upsampling operator of the DeepLabV3plus network uses a nearest neighbor interpolation mode for calculation; step S2, adjusting the structure of the basic network to form a semantic segmentation initial model, wherein the adjusting process is as follows: copying a convolution module of the DeepLabV3plus network to the bottom of the DeepLabV3plus network to be used as a bottom layer convolution module, merging the input end of the basic network with the output end of the DeepLabV3plus network in a jump connection mode after passing through a convolution layer group, taking the merged end as the input end of the bottom layer convolution module, and outputting the bottom layer convolution module as a final semantic segmentation result; and step S3, training the semantic segmentation initial model by using the sample image training set of the traffic road scene to obtain an image semantic segmentation model.

Further, the convolution layer group includes one or more convolution layers, each convolution layer has a convolution kernel size of 3 × 3, a convolution step size of 1, and a padding value of 0 or 1.

Further, the convolution layer group comprises a plurality of convolution layers, and the number of the plurality of convolution layers is more than 1 and less than or equal to 3.

Further, the fill values include a fill width value and a fill height value.

Further, the structure of the convolution module and the underlying convolution module of the deep lab v3plus network comprises from top to bottom: a convolutional layer, a BN layer, a Relu layer, and a convolutional layer.

Further, the sample image training set comprises a first training set and a second training set, wherein training images in the first training set are selected from front-view traffic road scene images of the front vehicle-mounted image capturing device in the road direction and front-view traffic road scene images of the rear vehicle-mounted image capturing device in the road direction in the Audi large-scale automatic driving data set A2D 2; the training images in the second training set are expressway scene images.

Further, step S3 includes: step S31, pre-training the semantic segmentation initial model by using a first training set to obtain a semantic segmentation pre-training model; and step S32, adjusting the learning rate of model training, and continuously training the semantic segmentation pre-training model by using the second training set to obtain the image semantic segmentation model.

Further, in step S32, the ratio of the learning rate when training the semantic segmentation pre-training model to the learning rate when training the semantic segmentation initial model is adjusted is between [1/5, 1/2 ].

Further, the ratio of the number of training images in the second training set to the number of training images in the first training set ranges from [1/10, 1 ].

Further, the Resnet x network in the infrastructure network is one of a Resnet18 network, a Resnet34 network, a Resnet50 network, a Resnet101 network, and a Resnet152 network.

By applying the technical scheme of the invention, in a basic network of the image semantic segmentation model, an upsampling operator of a DeepLabV3plus network uses a nearest interpolation mode to calculate, in the DeepLabV3plus network, a feature map is upsampled for many times, in the interpolation process of the feature map, interpolation operation is carried out between two adjacent feature values in the feature map, and the feature value of a point location at an interpolation position is the nearest feature value of the point location, so that pixel information loss and boundary saw tooth can be caused to a segmented image finally output by the model, and further the distortion of the segmented image finally output by the model can be caused. By using the technical scheme of the invention, the upsampling operator of the upsampling module can be kept unchanged in the nearest interpolation mode, so that the image semantic segmentation model can be stably, conveniently and completely deployed on a traffic road scene analysis platform; furthermore, an input end of the basic network is merged with an output end of the DeepLabV3plus network in a jump connection mode after passing through a convolution layer group, a feature map containing complete feature information of an input image and a feature map in the basic network losing part of feature information through repeated up-sampling calculation are subjected to feature map stacking operation to obtain a merged feature map, a bottom convolution module continues feature learning on the feature information of the merged feature map and outputs the feature map according to the output dimension of the basic network, and therefore the finally output segmentation image can be guaranteed not to lose pixel information of the original input image in the image semantic segmentation process, the edge of the segmentation image is guaranteed to be continuous, the whole segmentation image is clear, and the semantic segmentation processing effect of the image on the input image by the image semantic segmentation model is effectively improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 illustrates a flow diagram of a method for training an image semantic segmentation model for a traffic road scene in accordance with an alternative embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a base network of a semantic segmentation base model, DeepLabV3plus + Resnet50, according to an alternative embodiment of the present invention;

fig. 3 shows a network structure diagram of the semantic segmentation initial model obtained after the structure of the basic network of the semantic segmentation basic model in fig. 2 is adjusted.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," "includes," "including," "has," "having," and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method aims to solve the problem that in the prior art, an upsampling operator of an upsampling module of an image semantic segmentation model is calculated by using a nearest interpolation mode; the invention further provides a training method of the image semantic segmentation model for traffic road scenes.

FIG. 1 is a flowchart of an image semantic segmentation model training method for a traffic road scene according to an embodiment of the invention. As shown in fig. 1, the method comprises the steps of: step S1, constructing a semantic segmentation basic model with a basic network combined by a DeepLabV3plus network and a ResnetX network, wherein an upsampling operator of the DeepLabV3plus network uses a nearest neighbor interpolation mode for calculation; step S2, adjusting the structure of the basic network to form a semantic segmentation initial model, wherein the adjusting process is as follows: copying a convolution module of the DeepLabV3plus network to the bottom of the DeepLabV3plus network to be used as a bottom layer convolution module, merging the input end of the basic network with the output end of the DeepLabV3plus network in a jump connection mode after passing through a convolution layer group, taking the merged end as the input end of the bottom layer convolution module, and outputting the bottom layer convolution module as a final semantic segmentation result; and step S3, training the semantic segmentation initial model by using the sample image training set of the traffic road scene to obtain an image semantic segmentation model.

As shown in fig. 2, in the basic network of the image semantic segmentation model, an upsampling operator of a deplab v3plus network uses a nearest interpolation mode to calculate, in the deplab v3plus network, a feature map is upsampled for multiple times, in the process of interpolating the feature map, an interpolation operation is performed between two adjacent feature values in the feature map, and a feature value of a point location at an interpolation position is a feature value nearest to the point location, so that a segmented image finally output by the model has pixel information loss and boundary aliasing, and further distortion of the segmented image finally output by the model is caused. By using the technical scheme of the invention, the upsampling operator of the upsampling module can be kept unchanged in the nearest interpolation mode, so that the image semantic segmentation model can be stably, conveniently and completely deployed on a traffic road scene analysis platform; further, as shown in fig. 3, an input end of the base network is merged with an output end of the deep lab v3plus network in a jump connection manner after passing through a convolution layer group, a feature map containing complete feature information of an input image and a feature map in the base network subjected to multiple upsampling calculation to lose part of feature information are subjected to feature map stacking operation to obtain a merged feature map, and a bottom convolution module continues to perform feature learning on the feature information of the merged feature map and outputs the feature map according to the output dimension of the base network, so that a finally output segmented image can be ensured not to lose pixel information of an original input image in an image semantic segmentation process, the edges of the segmented images are ensured to be continuous, the whole segmented image is clear, and the semantic segmentation processing effect of the image semantic segmentation model on the input image is effectively improved.

It should be noted that, in the illustrated embodiment of the present invention, as shown in fig. 2, the upsampling operator at three positions in the basic network of the image semantic segmentation model is calculated by using the nearest neighbor interpolation mode, wherein one position is in the ASPP structure of the deep lab v3plus network, and the other two positions are respectively between the ASPP structure of the deep lab v3plus network and the convolution module, and after the convolution module; therefore, the image semantic segmentation model under the PyTorch deep learning framework can be smoothly converted into other deep learning frameworks.

In the basic network of the image semantic segmentation model in the embodiment of the present invention, the upsampling module of the deplab v3plus network includes an upsampling operator, which can implement upsampling operation, and of course, the upsampling module may also include a plurality of upsampling operators or other operators, which can also complete upsampling operation, and the basic networks in other structural forms are also within the protection scope of the present invention.

As shown in fig. 3, the input end of the basic network is merged with the deep labv3plus network through a convolution layer group, the output end is connected in a jump connection mode, a feature map obtained without down-sampling after an input image enters the image semantic segmentation model is connected with a feature map obtained through up-sampling, a bottom layer convolution module located at the bottom layer of the network structure of the image semantic segmentation model is used for learning the overlapped features, the output boundary information is refined, a semantic segmentation result with high pixel accuracy is obtained, and the semantic segmentation performance of the image semantic segmentation model is improved.

The characteristic diagram referred to above is in the form of a matrix. The feature map is a three-dimensional matrix represented by W × H × m, m represents the number of division elements in semantic information included in the feature map, and m of feature maps obtained by different basic semantic division submodels are the same. In the preferred embodiment of the invention, the number of segmentation elements in the semantic information of the output feature map is 9, and a semantic segmentation image is obtained after post-processing; the final output of the underlying network results in a feature map with the same width and height as the input image, but with a different number of segmentation elements in the semantic information.

In step S3, the initial model of semantic segmentation is trained using a sample image training set of traffic road scenes to obtain an image semantic segmentation model. And correcting parameters of the image semantic segmentation model according to the predicted semantic segmentation result of the training image and the pre-labeled semantic segmentation information. And continuously and iteratively executing the steps S1 to S2 by using a plurality of training images until the training result of the image semantic segmentation model meets the preset convergence condition. And continuously iterating and training the image semantic segmentation model by using different training images in the training set, and when the value of the error calculated by the cross entropy loss function is smaller than a preset threshold value or the iteration frequency reaches a preset value, considering that the training result is converged, finishing the training to obtain the trained image semantic segmentation model, wherein the trained image semantic segmentation model can be directly used for performing image semantic segmentation on the image to be processed (input image).

Optionally, the convolution layer set includes one or more convolution layers, each convolution layer having a convolution kernel size of 3 × 3, a convolution step size of 1, and a padding value of 0 or 1. Note that the fill value includes a fill width value and a fill height value. When the padding value (padding value) is 1, pad-w and pad-h are included, i.e., the width and height of the padding in the processing of the original input image are all 1, for example, the pixel size 2 × 2 is padded to 3 × 3.

Preferably, the convolution layer group includes a plurality of convolution layers, and the number of the plurality of convolution layers is greater than 1 layer and equal to or less than 3 layers.

As shown in fig. 2 and 3, the structure of the convolution module and the underlying convolution module of the deplab v3plus network includes, from top to bottom: a convolutional layer, a BN layer, a Relu layer, and a convolutional layer.

In a preferred embodiment of the present invention, the sample image training set includes a first training set and a second training set, the training images in the first training set are selected from the group consisting of forward-view traffic road scene images captured by the forward vehicle-mounted image capturing device along the road direction and forward-view traffic road scene images captured by the rearward vehicle-mounted image capturing device along the road direction in the audi large-scale automatic driving data set A2D 2; the training images in the second training set are expressway scene images. Therefore, the data source of the training image is more objective and sufficient, the universality of the general traffic road scene is realized, and the performance of the image semantic segmentation model can be improved on limited data; and if the training images of the second training set are expressway scene images, the expressway scenes are overlaid, and the trained image semantic segmentation model is ensured to have more targeted traffic environment scenes for semantic segmentation, identification and detection.

The A2D2 large-scale automatic driving data set includes 41277 training images in total, in this embodiment, 33142 images (front captured road scene image and rear captured road scene image) from front and rear perspectives are selected as training data, and the trained model is used as the semantic segmentation initial model.

Further, step S3 of the embodiment of the present invention includes: step S31, pre-training the semantic segmentation initial model by using a first training set to obtain a semantic segmentation pre-training model; and step S32, adjusting the learning rate of model training, and continuously training the semantic segmentation pre-training model by using the second training set to obtain the image semantic segmentation model.

Optionally, in step S32, the ratio of the learning rate when training the semantic segmentation pre-training model to the learning rate when training the semantic segmentation initial model is adjusted is between [1/5, 1/2 ]. In the preferred embodiment of the present invention, the ratio of the learning rate when training the semantic segmentation pre-training model to the learning rate when training the semantic segmentation initial model is adjusted to 1/3.

Preferably, the ratio of the number of training images in the second training set to the number of training images in the first training set ranges [1/10, 1 ]. In a preferred embodiment of the invention, the number of training images in the second training set is equal to the number of training images in the first training set.

Optionally, the Resnet x network in the infrastructure network is one of a Resnet18 network, a Resnet34 network, a Resnet50 network, a Resnet101 network, and a Resnet152 network. In a preferred embodiment of the present invention, the ResnetX network in the underlying network is a Resnet50 network.

In the input image semantic segmentation of the traffic road scene by using the image semantic segmentation model obtained by training of the two network structures in fig. 2 and 3, semantic segmentation standard Metric (MIOU) calculation is performed as follows:

as can be seen from the above table, the mlou value of the image semantic segmentation model obtained by training the network structure in fig. 3 after the network structure in fig. 2 is adjusted by the technical solution of the present invention is higher than the mlou value of the image semantic segmentation model obtained by training the network structure in fig. 2 for each class in the input image, which indicates that the semantic segmentation performance of the image semantic segmentation model obtained by training the network structure in fig. 3 is superior, and the semantic segmentation effect on the input image is better.

The invention also provides electronic equipment which can be a tablet computer, a portable computer, a notebook computer, a desktop computer and the like. The electronic equipment comprises a semantic segmentation network training device, an image semantic segmentation device, a memory, a storage controller and a processor. The memory, storage controller and processor elements are electrically connected to each other, directly or indirectly, to enable data transfer or interaction. Alternatively, the above components may be electrically connected to each other through one or more communication buses or signal lines. The semantic segmentation network training device and the image semantic segmentation device each comprise at least one software functional module which is stored in a memory in the form of software or firmware (firmware) or is solidified in an Operating System (OS) of the electronic device. The processor is used for executing an executable module stored in the memory, such as a software function module or a computer program included in the semantic segmentation network training device, and implementing or executing the methods, steps and logic block diagrams disclosed in the embodiments of the present invention to complete the image semantic segmentation model training method and obtain the image semantic segmentation model; the image semantic segmentation device comprises a software functional module or a computer program to realize semantic segmentation of the image to be processed.

The Memory may be selected from Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Read-Only Memory (EPROM), electrically Erasable Read-Only Memory (EEPROM), and the like. The memory is used for storing programs, and the processor executes the programs after receiving the execution instructions. The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a voice Processor, a video Processor, and the like; but may also be a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components a general purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A training method of an image semantic segmentation model for a traffic road scene is characterized by comprising the following steps:

step S1, constructing a semantic segmentation basic model with a basic network combined by a DeepLabV3plus network and a ResnetX network, wherein an upsampling operator of the DeepLabV3plus network is calculated by using a nearest neighbor interpolation mode;

step S2, adjusting the structure of the basic network to form a semantic segmentation initial model, wherein the adjusting process is as follows: copying a convolution module of the DeepLabV3plus network to the bottom of the DeepLabV3plus network to be used as a bottom layer convolution module, merging the input end of the basic network with the output end of the DeepLabV3plus network in a jump connection mode after passing through a convolution layer group, taking the merged end as the input end of the bottom layer convolution module, and outputting the bottom layer convolution module as a final semantic segmentation result;

and step S3, training the semantic segmentation initial model by using a sample image training set of the traffic road scene to obtain an image semantic segmentation model.

2. The training method for image semantic segmentation models according to claim 1, wherein the convolutional layer group comprises one or more convolutional layers, the convolutional kernel size of each convolutional layer is 3 x 3, the convolutional step size is 1, and the padding value is 0 or 1.

3. The training method for the image semantic segmentation model according to claim 2, wherein the convolutional layer group comprises a plurality of convolutional layers, and the number of the convolutional layers is greater than 1 and less than or equal to 3.

4. The training method of the image semantic segmentation model according to claim 2, wherein the filling values comprise a filling width value and a filling height value.

5. The training method of the image semantic segmentation model according to claim 1, wherein the structure of the convolution module of the DeepLabV3plus network and the underlying convolution module comprises from top to bottom: a convolutional layer, a BN layer, a Relu layer, and a convolutional layer.

6. The training method of the image semantic segmentation model according to claim 1, wherein the sample image training set comprises a first training set and a second training set, and training images in the first training set are selected from front-view traffic road scene images of a front vehicle-mounted image capturing device in a road direction and front-view traffic road scene images of a rear vehicle-mounted image capturing device in a road direction in an Audi large-scale automatic driving data set A2D 2; and the training images in the second training set are expressway scene images.

7. The training method for image semantic segmentation model according to claim 6, wherein the step S3 includes:

step S31, pre-training the semantic segmentation initial model by using the first training set to obtain a semantic segmentation pre-training model;

and step S32, adjusting the learning rate of model training, and continuously training the semantic segmentation pre-training model by using the second training set to obtain the image semantic segmentation model.

8. The method for training an image semantic segmentation model according to claim 7, wherein in the step S32, the ratio of the learning rate when training the semantic segmentation pre-training model to the learning rate when training the semantic segmentation initial model is adjusted to be between [1/5, 1/2 ].

9. The training method of the image semantic segmentation model according to claim 6, wherein the ratio of the number of training images in the second training set to the number of training images in the first training set ranges from [1/10, 1 ].

10. The method for training the image semantic segmentation model according to claim 1, wherein the ResnetX network in the base network is one of a Resnet18 network, a Resnet34 network, a Resnet50 network, a Resnet101 network, and a Resnet152 network.