CN109145983A

CN109145983A - A kind of real-time scene image, semantic dividing method based on lightweight network

Info

Publication number: CN109145983A
Application number: CN201810952416.1A
Authority: CN
Inventors: 程建; 苏炎洲; 郭桦; 康玄烨; 刘济樾
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2019-01-04

Abstract

The invention discloses a kind of real-time scene image, semantic dividing method based on lightweight network, comprising the following steps: S1, according to scene image data collection, one lightweight network class model of training；S2, depth convolutional neural networks model is constructed based on lightweight network class model；S3, scene image data concentration training data are input to depth convolutional neural networks, export forecast image, and the semantic tagger image comparison concentrated with scene image data, and calculate and intersect entropy loss as objective function, obtain trained image, semantic parted pattern；S4, real-time scene image to be tested is input in image, semantic parted pattern, obtains image, semantic segmentation result.The present invention passes through the MobileNetV2 that will be modified as basic network, can efficiently extract characteristics of image, in upper sampling process, with quick link block, keep parameter utilization more efficient, further improve the speed of semantic segmentation model.

Description

A kind of real-time scene image, semantic dividing method based on lightweight network

Technical field

The invention belongs to image, semantic segmentation technologies, and in particular to a kind of real-time scene figure based on lightweight network As semantic segmentation method.

Background technique

Scene Semantics segmentation should belong to the application that image, semantic is segmented on scene image.Scene Semantics are divided to subsequent Computer Vision Task is of crucial importance, such as the differentiation of unmanned middle pedestrian and vehicle etc..

Semantic segmentation is in the important component of many practical application scenes, such as machine vision, automatic Pilot, and movement Calculate etc., scene is extremely important to the decision of practical application around accurate understanding, and therefore, runing time is assessment semantic segmentation system Key factor of the system in practical application scene.Currently, the development of depth convolutional neural networks achieved in semantic segmentation it is aobvious Write progress, but most of correlative studys all concentrate on improve segmentation precision rather than in the computational efficiency of model, these networks Validity is largely the design depending on complicated depth and width model, this needs is related to much operations and ginseng Number.Then, a large amount of practical application scene such as automated driving system is typically based on embedded device, calculates and storage resource phase To limited.It has been more than some movements or embedded system institute energy that network for semantic segmentation, which requires very high computing resource, It provides, causes that accuracy rate is higher, but the situation that speed is far from enough.MobileNetV2 be it is a kind of for mobile or The limited neural network structure of mobile resources, he can be by substantially reducing the quantity of operation and memory, while keeping identical essence Degree.

In recent years, most current optimal image, semantic dividing methods are all based on depth convolutional neural networks. Typical semantic segmentation network structure is the structure based on coder-decoder, and encoder is an image drop sampling process, is born The coarse semantic feature of abstract image is blamed, a decoder is followed by, a picture up-sampling process of decoder is responsible for Up-sampling is carried out to down-sampled obtained characteristics of image and is restored to the original dimension of input picture.Lightweight network is appointed in image classification Present very outstanding in business as a result, coming lightweight network foundation network (i.e. encoder) quickly to extract scene image Feature can also promote speed while not sacrificing accuracy rate.

Summary of the invention

For above-mentioned deficiency in the prior art, the real-time scene image, semantic provided by the invention based on lightweight network Dividing method solves in the prior art, realizes the slow problem of image, semantic splitting speed.

In order to achieve the above object of the invention, the technical solution adopted by the present invention are as follows: a kind of based on the real-time of lightweight network Scene image semantic segmentation method, comprising the following steps:

S1, according to scene image data collection, lightweight network class model of the training one by image to class label；

S2, depth convolutional neural networks model is constructed based on lightweight network class model；

S3, scene image data concentration training data are input to depth convolutional neural networks, export forecast image, and with The semantic tagger image comparison that scene image data is concentrated, and calculate and intersect entropy loss as objective function, it is trained Image, semantic parted pattern；

S4, real-time scene image to be tested is input in image, semantic parted pattern, obtains image, semantic segmentation knot Fruit.

Further, in the step S1,

The lightweight network class model include sequentially connected 1 conv2d unit, 17 bottleneck units, 11 × 1 conv2d, unit 17 × 7 avgpool unit and 11 × 1 conv2d unit；

Each bottleneck unit includes the first INPLACE-ABN layers, the 2nd INPLACE-ABN layers and one Projection layer.

Further, in the lightweight network class model:

When step-length is 1, the structure of the bottleneck unit are as follows: the first INPLACE-ABN layers, the 2nd INPLACE- ABN layers and a projection layer connection in sequential series, input terminal simultaneously with the first INPLACE-ABN layer and projection layer series connection, Output end of the projection layer as bottleneck unit；

When step-length is 2, the structure of the bottleneck unit are as follows: the first INPLACE-ABN layers, the 2nd INPLACE- ABN layers and a projection layer connection in sequential series, input terminal only with the first INPLACE-ABN layers connect, projection layer conduct The output end of bottleneck unit.

Further, in the step S3:

Image, semantic parted pattern is encoder-decoder network structure；

The encoder is lightweight network class model, for extracting characteristics of image；

The decoder includes sequentially connected quick link block and one 1 × 1 convolutional layer, for restoring image resolution Rate.

Further, the quick link block includes that sequentially connected 11 × 1 convolutional layer, 13 × 3 depth can Separate convolution unit, 11 × 1 convolutional layer and 1 quick connection；

It includes sequentially connected depth convolutional layer and point-by-point convolutional layer that the depth, which separates convolution unit,.

Further, in the step S3, the depth convolutional neural networks training process are as follows:

S31, the training data image that scene image data is concentrated is pre-processed；

S32, using the parameter value of trained lightweight network class model as depth convolutional neural networks model just Initial value；

S33, data amplification processing is carried out to training data image；

S34, expanded using data after training data image each pixel intersection entropy loss sum as loss function, The training to depth convolutional neural networks model is completed using multinomial learning strategy using stochastic gradient descent method；

Further, in the step S34 multinomial learning strategy learning rate lr are as follows:

Wherein, baselr is initial learning rate；

Iter is current iteration number；

Total_iter is total the number of iterations；

Subscript power is polynomial power.

Further,

Training data image carries out pretreatment as the size of image is cut to 224 × 224 in the step S31；

Data amplification processing includes being overturn to image, being contracted at random at random between 0.5 to 2 times in the step S33 Put image and the Random-Rotation image between -10 degree and 10 degree.

The invention has the benefit that the real-time scene image, semantic segmentation side provided by the invention based on lightweight network Method passes through the MobileNetV2 that will be modified as basic network, can efficiently extract characteristics of image, in upper sampling process, The utilization of quick link block, keeps parameter utilization more efficient, further improves the speed of semantic segmentation model.

Detailed description of the invention

Fig. 1 is the realtime graphic semantic segmentation method implementation process based on lightweight network in embodiment provided by the invention Figure.

Fig. 2 is two kinds of bottlencke schematic diagram of a layer structure in embodiment provided by the invention.

Fig. 3 is depth convolutional neural networks training flow chart in embodiment provided by the invention.

Specific embodiment

A specific embodiment of the invention is described below, in order to facilitate understanding by those skilled in the art this hair It is bright, it should be apparent that the present invention is not limited to the ranges of specific embodiment, for those skilled in the art, As long as various change is in the spirit and scope of the present invention that the attached claims limit and determine, these variations are aobvious and easy See, all are using the innovation and creation of present inventive concept in the column of protection.

As shown in Figure 1, a kind of real-time scene image, semantic dividing method based on lightweight network, comprising the following steps:

Above-mentioned scene image data integrates as Cityscapes avenue contextual data collection, (contains 1 comprising 20 classification marks A background classification), cover European 50 cities, totally 5000 data sets finely marked, using wherein 2975 as trained number According to collection, 500 are used as validation data set, and 1525 are used as test data set.

The network structure of above-mentioned lightweight network class model is as shown in table 1:

Table 1: lightweight network class prototype network structure table

layer	Input	Operator	t	c	n	s
							1	224²×3	Conv2d	-	32	1	2
2	112²×32	bottleneck	1	16	1	1
							3	112²×16	bottleneck	6	24	2	2
4	56²×24	bottleneck	6	32	3	2
							5	28²×32	bottleneck	6	64	4	2
6	14²×64	bottleneck	6	96	3	1
							7	14²×96	bottleneck	6	160	3	2
8	7²×160	bottleneck	6	320	1	1
							9	7²×320	Conv2d 1×1	-	1280	1	1
10	7²×1280	Avgpool7×7	-	-	1	-
							11	1×1×1280	Conv2d 1×1	-	k	-

In table, t indicates that ' expansion ' multiple, c indicate that output channel number, n indicate number of repetition, and s indicates step-length.

The lightweight network class model include sequentially connected 1 conv2d unit, 17 bottleneck units, 11 × 1 conv2d unit 17 × 7 avgpool unit and 11 × 1 conv2d unit；

Each bottleneck unit includes the first INPLACE-ABN layers, the 2nd INPLACE-ABN layers and one Projection layer.INPLACE-ABN is the new method for efficiently reducing deep neural network training memory consumption, can replace tradition Batch normalization and active coating, bring better semantic segmentation effect.

As shown in Fig. 2, when step-length is 1, the structure of the bottleneck unit are as follows: the first INPLACE-ABN layers, the Two INPLACE-ABN layers and a projection layer connection in sequential series, input terminal simultaneously with the first INPLACE-ABN layers and projection layer It is connected in series, output end of the projection layer as bottleneck unit；

By the characteristic pattern of the layer3 output in table 1, the characteristic pattern having a size of 1122 × 16, layer4 output, size 56² The characteristic pattern of × 24, layer5 output, having a size of 28²The characteristic pattern of × 32, layer7 output, having a size of 14²× 96, make respectively It is characterized first layer, the second layer, third layer, the 4th layer for extracting network (encoder), is denoted as Encoder_1 respectively, Encoder_2, Encoder_3, Encoder_4.

Characteristic pattern after Encoder_4 and Encoder_3 up-sampling is input in quick link block, Decoder_ is exported 1.Characteristic pattern after Decoder_1 and Encoder_2 up-sampling is input in quick link block, Decoder_2 is exported.It will Decoder_2 is input to the characteristic pattern after Encoder_1 up-sampling and fast connect in fast, exports Decoder_3.Finally will Encoder_4, Decoder_1, Decoder_2, Decoder_3 are up-sampled to input picture size, then by obtain four Characteristic pattern is connected, and 11 × 1 convolution is finally passed through, and the characteristic pattern obtained and semantic segmentation mark image calculate loss letter Number, error back propagation update weight, obtain semantic segmentation network model.

In above-mentioned steps S3:

Image, semantic parted pattern is encoder-decoder network structure；

The encoder is lightweight network class model, for extracting characteristics of image；In order to retain the space letter of image Breath, removes the full articulamentum of the lightweight network, and as encoder.

The decoder includes sequentially connected quick link block and one 1 × 1 convolutional layer, for restoring image resolution Rate.Characteristic pattern is up-sampled, finally by each spy of the output of decoder in conjunction with the characteristic pattern of encoder using quick link block Sign figure is up-sampled to original image size and is together in series, using 1 × 1 convolution, the characteristic pattern finally obtained and semantic segmentation mark It infuses image and carries out error back propagation, obtain neural network model.

Wherein, quick link block to include sequentially connected be 11 × 1 convolutional layer, 13 × 3 separable volume of depth Product unit, 11 × 1 convolutional layer and 1 quick connection, wherein 3 × 3 depth, which separates convolution unit, up-samples feature Figure.

It includes sequentially connected depth convolutional layer and point-by-point convolutional layer that the depth, which separates convolution unit,；Depth convolutional layer The filtering of lightweight is realized by applying a convolution filter in each input channel；Second is 11 × 1 convolution Layer, i.e., point-by-point convolutional layer establish new feature by calculating the linear combination of input channel.Depth separates convolution and realizes Decoupling between space and channel achievees the purpose that model accelerates, is widely used in lightweight network.

Wherein, cross entropy loss function are as follows:

In formula, y indicates sample label,Indicate prediction output.

As shown in figure 3, in above-mentioned steps S3, the depth convolutional neural networks training process are as follows:

Training data image is subjected to pretreatment for the size of image is cut to 224 × 224；

S33, data amplification processing is carried out to training data image；

Data amplification processing includes being overturn at random to image, between 0.5 to 2 times in the step S43, random to contract It puts image and is spent between 10 degree -10, Random-Rotation image.

The learning rate lr of multinomial learning strategy in the step S34 are as follows:

Wherein, baselr is initial learning rate；It is set as 0.001；

Iter is current iteration number；

Total_iter is total the number of iterations；

Subscript power is polynomial power, is set as 0.9.

Claims

1. a kind of real-time scene image, semantic dividing method based on lightweight network, which comprises the following steps:

S3, scene image data concentration training data are input to depth convolutional neural networks, export forecast image, and and scene The semantic tagger image comparison that image data is concentrated, and calculate and intersect entropy loss as objective function, obtain trained figure As semantic segmentation model；

S4, real-time scene image to be tested is input in image, semantic parted pattern, obtains image, semantic segmentation result.

2. the real-time scene image, semantic dividing method according to claim 1 based on lightweight network, which is characterized in that In the step S1,

The lightweight network class model includes sequentially connected 1 conv2d unit, 17 bottleneck units, 11 × 1 conv2d unit, 17 × 7 avgpool unit and 11 × 1 conv2d unit；

Each bottleneck unit includes the first INPLACE-ABN layers, the 2nd INPLACE-ABN layers and a projection Layer.

3. the real-time scene image, semantic dividing method according to claim 2 based on lightweight network, which is characterized in that In the lightweight network class model:

When step-length is 1, the structure of the bottleneck unit are as follows: the first INPLACE-ABN layers, the 2nd INPLACE-ABN Layer and a projection layer connection in sequential series, input terminal with the first INPLACE-ABN layers and projection layer series connection, project simultaneously Output end of the layer as bottleneck unit；

When step-length is 2, the structure of the bottleneck unit are as follows: the first INPLACE-ABN layers, the 2nd INPLACE-ABN Layer and a projection layer connection in sequential series, input terminal only with the first INPLACE-ABN layers connect, projection layer conduct The output end of bottleneck unit.

4. the real-time scene image, semantic dividing method according to claim 1 based on lightweight network, which is characterized in that In the step S3:

Image, semantic parted pattern is encoder-decoder network structure；

The decoder includes sequentially connected quick link block and one 1 × 1 convolutional layer, for restoring image resolution ratio.

5. the realtime graphic semantic segmentation method according to claim 4 based on lightweight network, which is characterized in that

The quick link block includes that sequentially connected 11 × 1 convolutional layer, 13 × 3 depth separate convolution unit, 1 A 1 × 1 convolutional layer and 1 quick connection；

6. the real-time scene image, semantic dividing method according to claim 4 based on lightweight network, which is characterized in that In the step S3, the depth convolutional neural networks training process are as follows:

S32, using the parameter value of trained lightweight network class model as the initial value of depth convolutional neural networks model；

S33, data amplification processing is carried out to training data image；

S34, expanded using data after training data image each pixel intersection entropy loss sum as loss function, use Stochastic gradient descent method completes the training to depth convolutional neural networks model using multinomial learning strategy.

7. the real-time scene image, semantic dividing method according to claim 6 based on lightweight network, which is characterized in that

Wherein, baselr is initial learning rate；

Iter is current iteration number；

Total_iter is total the number of iterations；

Subscript power is polynomial power.

8. the realtime graphic semantic segmentation method according to claim 7 based on lightweight network, which is characterized in that

In the step S33 data amplification processing include image is overturn at random, the scaling figure at random between 0.5 to 2 times Picture and the Random-Rotation image between -10 degree and 10 degree.