CN110110692A

CN110110692A - A kind of realtime graphic semantic segmentation method based on the full convolutional neural networks of lightweight

Info

Publication number: CN110110692A
Application number: CN201910410492.4A
Authority: CN
Inventors: 武港山; 沈佳凯
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2019-08-09

Abstract

The invention discloses a kind of realtime graphic semantic segmentation methods based on the full convolutional neural networks of lightweight, the following steps are included: 1) construct full convolutional neural networks using the design element of lightweight neural network: network includes that feature opens up increasing stage, characteristic processing stage, integrated forecasting stage three phases altogether, and wherein the characteristic processing stage expands structure using more receptive field Fusion Features structures, more size convolution fusion structures, receptive field；2) training stage: using semantic segmentation data set training network, using and intersect entropy function as loss function, use Adam algorithm as parameter optimization algorithm, in the process using online difficult example sample retraining strategy；3) test phase: inputting network for test image, obtains semantic segmentation result.The present invention, by adjusting network structure, is adapted to semantic segmentation task while Controlling model scale, obtains and is suitble in the operation of mobile terminal platform, high-precision real-time semantic segmentation method.

Description

A kind of realtime graphic semantic segmentation method based on the full convolutional neural networks of lightweight

Technical field

The invention belongs to computer software technical fields, are related to image, semantic cutting techniques, specially a kind of to be based on light weight The realtime graphic semantic segmentation method of the full convolutional neural networks of grade.

Background technique

Image, semantic segmentation is an intensive prediction classification task, needs to predict point of each pixel of input picture Class label is often used as guide's treatment process of the tasks such as scene Recognition, automatic obstacle-avoiding, is that the hot topic of computer vision field is ground Study carefully project.It cuts a conspicuous figure in ImageNet contest from AlexNet in 2012, deep learning is widely used in computer vision Field.Currently, the method based on deep learning alsies occupy the half of the country in semantic segmentation field, wherein most is using full volume Product neural network, and gradually form decoded common structure after first coding.In coding stage, it is special that depth is extracted by convolution operation Sign reduces the size of characteristic pattern by pondization operation or the convolution operation with step-length；In decoding stage, by convolution operation into One step analyzes feature, uses the upper size that characteristic pattern is gradually recovered using operation such as warp lamination.

Since the functions such as automatic Pilot need, semantic segmentation is also commonly used for the mobile terminals such as unmanned plane platform, but a side Face mobile terminal platform internal memory and calculation power are limited, and the actual tasks such as another aspect automatic Pilot need, and need semantic segmentation high in real time The progress of effect, and the leading edge method scale of model based on deep learning is larger, when operation, need mobile terminal hardware that can not provide The computing resources such as memory and calculation power.It is currently, there are some use and simplifies neural network as core network for semantic segmentation Method, but such methods are generally only to continue to use the network structure of object classification, to semantic segmentation, this specific tasks is not done The adaptation and adjustment of network structure out, so it is barely satisfactory in terms of nicety of grading.

Summary of the invention

The problem to be solved in the present invention is: pair of the existing semantic segmentation method in hardware computing capability and real-time performance It is important to ask down, it is difficult to be run in mobile terminal platform, or because hardware condition limits, it is difficult to it is run in mobile terminal platform, or because Method defect, in the operation of mobile terminal platform, nicety of grading is poor.

The technical solution of the present invention is as follows: a kind of realtime graphic semantic segmentation side based on the full convolutional neural networks of lightweight Method using the design construction network of lightweight network, and adjusts Network adaptation to semantic segmentation task, comprising the following steps:

1) construct full convolutional neural networks using the design element of lightweight neural network: full convolutional neural networks include altogether Feature opens up increasing stage, characteristic processing stage and integrated forecasting stage three phases, feature open up the increasing stage in advance pond, quickly Reduce characteristic pattern size, lifting feature port number, the characteristic processing stage for extracting characteristic information, deepens network depth, expands Convolution receptive field obtains character representation abundant, and size restoration is original image size, net according to tagsort by the integrated forecasting stage Convolutional layer in network is by batch standardization, and using PReLU as activation primitive, the characteristic processing stage is special using more receptive fields It levies fusion structure, more size convolution fusion structures and receptive field and expands structure；

2) training stage: using the full convolutional neural networks of semantic segmentation data set training building, intersection entropy function is used As loss function, use Adam algorithm as parameter optimization algorithm, in the training process using online difficult example sample retraining Strategy；

3) test phase: inputting network for test image, obtains semantic segmentation result.

It is preferred that in step 1),

Feature opens up the increasing stage and is divided into two-way and extracts feature parallel, uses two concatenated extraction process modules all the way, each Extraction process module successively uses regular volume lamination and the detachable convolutional layer of two layer depths, and wherein regular volume lamination uses 3* 3 convolution kernel, step-length 2, the detachable convolutional layer of depth use the convolution kernel of 3*3, step-length 1；Another way is by Chi Huahe The mode that convolution combines extracts feature；Then by two-way merging features, then to pass sequentially through regular volume lamination, two depth separable The convolutional layer that formula convolutional layer and a step-length are 2, output enter the characteristic processing stage；

Characteristic processing stage, input feature vector share 256 channels, and 1/8 having a size of original image size, the characteristic processing stage is total It is divided into two branches a: branch, uses convolution kernel that port number is contracted to 192 for 1 convolutional layer having a size of 1*1, step-length first, obtain To feature set of graphs F1, then pass sequentially through μ Fusion Module processing, each Fusion Module by more size convolution fusion structures and More receptive field Fusion Features structures in series compositions, obtain feature set of graphs F2, merge F1 and F2 by phase add operation and obtain feature Set of graphs F3 is then input with F3, passes sequentially through λ receptive field amplification structure, obtains characteristic pattern output F4；Another branch is adopted The regular volume lamination processing for being 1*1 with convolution kernel, obtains the feature set of graphs R1 in 64 channels, by the output F4 of R1 and previous branch Splicing obtains the output feature in 256 channels；

The integrated forecasting stage realizes the friendship in characteristic processing stage output feature between 256 channels using the convolution of 1*1 Stream then obtains the classification results of small size using the convolutional layer of 3*3, and wherein the output channel number of convolutional layer is equal to data set Tag along sort number, 1/8 having a size of original image size, finally using up-sampling layer, by linear interpolation by dimension enlargement to original image Size.

Further, more receptive field Fusion Features structures described in step 1), by two concatenated feature processing block A groups At the feature processing block A is by the first convolutional layer, narrow receptive field convolutional layer, wide receptive field convolutional layer and Fusion Features convolution Layer is sequentially connected composition, successively passes through convolution operation, batch standardization and PReLU activation primitive, the first volume in each convolutional layer Lamination uses the convolution kernel of 1*1, output channel number for the 1/2 of input channel number, obtains feature set of graphs and is denoted as P1, narrow receptive field Using P1 as input, output channel number is equal to input channel number, detachable using depth for convolutional layer, wide receptive field convolutional layer Convolution operation, narrow receptive field convolutional layer use the convolution kernel of 3*3, and wide receptive field convolutional layer uses the expansion convolution kernel of 3*3, expansion Coefficient is 2, the splicing of characteristic pattern is then passed through to the output of narrow receptive field convolutional layer and wide receptive field convolutional layer, by the spy of splicing Sign is input to Fusion Features convolutional layer, this layer of convolutional layer uses the convolution kernel of 1*1, and output channel number is equal to input channel number；Through The processing for crossing two feature processing block A obtains feature set of graphs P2, is finally added the input of this structure with P2, obtains defeated Feature set of graphs out.

Further, more size convolution fusion structures described in step 1), by two concatenated feature processing block B groups At, the feature processing block B is sequentially connected and is constituted by the first convolutional layer, more size convolutional layers and Fusion Features convolutional layer, In more size convolutional layers receive the output of the first convolutional layer, Fen Sanlu does the detachable convolution operation of depth, respectively using 1*1, The convolution kernel of 3*3,5*5, subsequent three tunnels output are input to Fusion Features convolutional layer by splicing, this layer of convolutional layer uses 1*1's Convolution kernel；The input of this structure is finally added to acquisition output feature set of graphs with the output of two feature processing block B.

Further, receptive field described in step 1) expands structure, is melted by the first convolutional layer, receptive field amplification layer and feature It closes convolutional layer and is sequentially connected composition, receptive field expands the output of layer the first convolutional layer of receiving as input, and dividing two-way to do depth can Separate type convolution operation, the first via successively do the detachable convolution operation of depth using the convolution kernel of 1*7 and 7*1, and the second tunnel connects Two-way is then obtained characteristic pattern splicing and is input to Fusion Features convolution by the continuous convolution operation process used using the first via twice Layer, output obtain output feature set of graphs.

The present invention has the following advantages compared with prior art

The present invention uses the design of lightweight network, so that scale of model is less than 1M, it is interior when greatly reducing operation It deposits occupancy and calculates data volume.

In order to be adapted to semantic segmentation task, three kinds of structures are proposed, can preferably utilize contextual information pre- for classifying It surveys, it is final to obtain the higher semantic segmentation result of precision.

Detailed description of the invention

Fig. 1 is method flow schematic diagram of the invention.

Fig. 2 is more receptive field Fusion Features structures of the invention.

Fig. 3 is more size convolution fusion structures of the invention.

Fig. 4 is receptive field amplification structure of the invention.

Fig. 5 is the network processing arrangement schematic diagram in characteristic processing stage and integrated forecasting stage of the invention.

Fig. 6 shows the semantic segmentation example using the present invention on CamVid data set, and (a) indicates original image, (b) indicate language The result of justice segmentation.

Fig. 7 shows the semantic segmentation example using the present invention on Cityscapes data set, and (a) indicates original image, (b) table Show the result of semantic segmentation.

Specific embodiment

The invention proposes a kind of realtime graphic semantic segmentation methods based on the full convolutional neural networks of lightweight.It proposes The network module structure of three adaptation semantic segmentation tasks by the training of CamVid and Cityscapes two datasets and is surveyed Examination, not only scale of model is controlled, and under real-time predetermined speed, obtains high-precision semantic segmentation result.

The implementation steps of the invention is specific as follows:

1) full convolutional neural networks are constructed using the design element of lightweight neural network, lightweight network model is such as MobileNetV2, ShuffleNetV2 etc..The network that the present invention designs includes that feature opens up increasing stage, the characteristic processing stage, comprehensive altogether Forecast period three phases are closed, as shown in Figure 1.

Feature opens up the increasing stage and is divided into two-way and extracts feature parallel.Two concatenated extraction process modules are used all the way, each Extraction process module successively uses regular volume lamination and the detachable convolutional layer of two layer depths.Wherein regular volume lamination uses 3* 3 convolution kernel, step-length 2, the detachable convolutional layer of depth use the convolution kernel of 3*3, step-length 1.Another way is by Chi Huahe The mode that convolution combines extracts feature.Then by two-way merging features, it is detachable to pass sequentially through regular volume lamination, two depth The convolutional layer that one step-length of convolution sum is 2, gained output enter the characteristic processing stage.It is to shift to an earlier date that feature, which opens up increasing stage major function, Chi Hua quickly reduces characteristic pattern size (be contracted to full size 1/8), lifting feature port number.

Characteristic processing stage, input feature vector share 256 channels, and 1/8 having a size of original image size, the characteristic processing stage is total It is divided into two-way, structure is as shown in Figure 5.One branch, using convolution kernel first having a size of 1*1, step-length is 1 convolutional layer by port number 192 are contracted to, feature set of graphs F1 is obtained, then passes sequentially through μ Fusion Module processing, each Fusion Module is by more size rolls Product fusion structure and more receptive field Fusion Features structures in series composition, obtain feature set of graphs F2, merge F1 by phase add operation Feature set of graphs F3 is obtained with F2, is then input with F3, λ receptive field amplification structure is passed sequentially through, obtains characteristic pattern output F4.Another branch uses convolution kernel to handle for the regular volume lamination of 1*1, the feature set of graphs R1 in 64 channels is obtained, finally by R1 The output feature in 256 channels is obtained with the output F4 splicing of previous branch.

Integrated forecasting stage structures are as shown in Figure 5.It is realized 256 in characteristic processing stage output feature using the convolution of 1*1 Exchange between channel, the classification results of small size then are obtained using the convolutional layer of 3*3, wherein port number n is equal to data set Tag along sort number, 1/8 having a size of original image size.Finally using up-sampling layer, by linear interpolation by dimension enlargement to former Figure size.

More receptive field Fusion Features structures, more size convolution fusion structures used in the characteristic processing stage of the invention, Receptive field expands structure, and structure is respectively as Fig. 2,3,4 show.

More receptive field Fusion Features structures are made of two concatenated feature processing block A, such as Fig. 2.The characteristic processing Modules A is sequentially connected and is constituted by the first convolutional layer, narrow receptive field convolutional layer, wide receptive field convolutional layer and Fusion Features convolutional layer, Successively by convolution operation, batch standardization and PReLU activation primitive in each convolutional layer, the first convolutional layer uses the volume of 1*1 Product core, 1/2 that output channel number is input channel number obtain feature set of graphs and are denoted as P1, narrow receptive field convolutional layer, wide receptive field Convolutional layer is using P1 as input, and output channel number is equal to input channel number, using the detachable convolution operation of depth, narrow impression Wild convolutional layer uses the convolution kernel of 3*3, and wide receptive field convolutional layer uses the expansion convolution kernel of 3*3, and the coefficient of expansion 2 is then right The splicing of characteristic pattern is passed through in the output of narrow receptive field convolutional layer and wide receptive field convolutional layer, and the feature of splicing is input to feature and is melted Convolutional layer is closed, this layer of convolutional layer uses the convolution kernel of 1*1, and output channel number is equal to input channel number；By two characteristic processings The processing of modules A obtains feature set of graphs P2, is finally added the input of this structure with P2, obtains output feature set of graphs.

More size convolution fusion structures are made of two concatenated feature processing block B, such as Fig. 3.The characteristic processing mould Block B is sequentially connected and is constituted by the first convolutional layer, more size convolutional layers and Fusion Features convolutional layer, wherein more size convolutional layers receive The output of first convolutional layer, Fen Sanlu do the detachable convolution operation of depth, use the convolution kernel of 1*1,3*3,5*5 respectively, with The output of three tunnels is input to Fusion Features convolutional layer by splicing afterwards, this layer of convolutional layer uses the convolution kernel of 1*1；Finally by this structure Input be added with the output of two feature processing block B acquisition output feature set of graphs.

Receptive field expands structure such as Fig. 4, is sequentially connected by the first convolutional layer, receptive field amplification layer and Fusion Features convolutional layer It constitutes.Receptive field, which is defined as convolutional neural networks feature, can see the region of input picture, in other words feature output by The influence of pixel in receptive field region.Receptive field expands the output of layer the first convolutional layer of receiving as input, and two-way is divided to do The detachable convolution operation of depth, the first via successively do the detachable convolution operation of depth using the convolution kernel of 1*7 and 7*1, the Two-way acquisition characteristic pattern splicing is then input to feature and melted by the two tunnels continuous use convolution operation process that the first via uses twice Convolutional layer is closed, output obtains output feature set of graphs.

2) respectively using CamVid and Cityscapes two datasets training network.In training process, need to image Data do pretreatment and data enhancing, then input network using fixed-size picture.Web vector graphic intersects entropy function conduct Loss function uses Adam algorithm as parameter optimization algorithm, in the process using online difficult example sample retraining strategy.

Pre-processing image data is handled using zero averaging, counts the pixel mean value and standard deviation in each channel in training set, The zero averaging in each channel is handled as shown in formula (1), wherein μ is channel mean value, and σ is the standard deviation under the channel.

Data enhancing is to open up increasing data by modes such as affine transformations using legacy data, and the diversity of data makes net Network may learn more data characteristicses, obtain better Generalization Capability.Data enhancing is at random using the scaling system of [0.5,2] Several pairs of images zoom in and out, and do mirror transformation horizontally or vertically to data at random.Then cut, it is scanty to use zero Value and the corresponding filling original image of ignorance label and label figure, obtain fixed-size sample data input network training.

In addition, in order to cope with the unbalanced of classification samples, passing through due to the quantity of classification each in data set and unbalanced The cross entropy of weighting is as loss function training network.The accounting for needing to count each classification in training set, using accounting as Weight when cross entropy calculates, to cope with the unbalanced problem of sample distribution.

Using following experimental setup, the super ginseng μ=8 of network structure, λ=2.It is initial to learn using Adam parameter optimization algorithm Habit rate is 1e-3, and beta=(0.9,0.999), weight attenuation coefficient is 4e-4, when using Cityscapes training, training picture Size is taken as 800*800, when using CamVid training, is taken as 360*360.

Convolutional layer parameter initialization does he initialization using kaiming distribution, so that at the beginning of network training, each layer defeated Entering can be consistent with output variance, reach constraint gradient, accelerate convergent purpose.

3) test phase: test image inputs network after zero-mean processing, obtains semantic segmentation result.

Performance test is done using the network that training obtains to illustrate effect of the invention.

On CamVid data set, model parameter amount about 800,000 amounts to 0.8M, committed memory about 292M when operation, In input having a size of 480*360, single frames predicted time is about 10ms, can achieve the frame per second of 100FPS, be can satisfy in real time Property require.On precision of prediction, friendship of the invention simultaneously reaches than mIoU (Mean Intersection over Union) 65.67%.Under equivalent parameters amount, the present invention shortens nearly ten times than efficient neural network ENet on predicted time, and pre- It surveys in precision and improves 14.4%.On precision of prediction, the present invention low compared to semantic segmentation network B iSeNet about 3%, still Scale of model is the 1/20 of BiSeNet, on predetermined speed, is also faster than BiSeNet.The visualization result that the present invention tests is as schemed Shown in 6.

On Cityscapes data set, model parameter amount is about 800,000, amounts to 0.82M, and committed memory is about when operation 317M, in the input having a size of 2048*1024, single frames predicted time is about 20ms, can achieve the frame per second of 50FPS.Prediction In precision, mIoU of the invention has reached 65.68%.It in the model of same magnitude, is whether divided by class, or presses major class It does not divide, mIoU of the invention has reached best effects.Compared to BiSeNet, the case where scale of model reduces about 20 times Under, precision of prediction of the invention still has comparativity.The visualization result that the present invention tests is as shown in Figure 7.

Claims

1. a kind of realtime graphic semantic segmentation method based on the full convolutional network of lightweight, it is characterized in that the following steps are included:

1) construct full convolutional neural networks using the design element of lightweight neural network: full convolutional neural networks include feature altogether Open up increasing stage, characteristic processing stage and integrated forecasting stage three phases, feature open up the increasing stage in advance pond, quick reduction Characteristic pattern size, lifting feature port number, characteristic processing stage for extracting characteristic information, deepen network depth, expand convolution Receptive field, obtains character representation abundant, and the integrated forecasting stage carries out classification prediction according to feature, and by characteristic pattern size restoration To original image size, the convolutional layer in network passes through batch standardization, using PReLU as activation primitive, characteristic processing stage Structure is expanded using more receptive field Fusion Features structures, more size convolution fusion structures and receptive field；

2) training stage: using semantic segmentation data set training building full convolutional neural networks, use intersect entropy function as Loss function uses Adam algorithm as parameter optimization algorithm, in the training process using online difficult example sample retraining strategy；

2. a kind of realtime graphic semantic segmentation method based on the full convolutional network of lightweight according to claim 1, special Sign is in step 1),

Feature opens up the increasing stage and is divided into two-way and extracts feature parallel, uses two concatenated extraction process modules, each extraction all the way Processing module successively uses regular volume lamination and the detachable convolutional layer of two layer depths, and wherein regular volume lamination uses 3*3's Convolution kernel, step-length 2, the detachable convolutional layer of depth use the convolution kernel of 3*3, step-length 1；Another way is by pond and convolution In conjunction with mode extract feature；Then by two-way merging features, then pass sequentially through regular volume lamination, two detachable volumes of depth The convolutional layer that lamination and a step-length are 2, output enter the characteristic processing stage；

Characteristic processing stage, input feature vector share 256 channels, and 1/8 having a size of original image size, the characteristic processing stage is divided into Two branches: a branch uses convolution kernel that port number is contracted to 192 for 1 convolutional layer having a size of 1*1, step-length first, obtains spy Set of graphs F1 is levied, then passes sequentially through μ Fusion Module processing, each Fusion Module is by more size convolution fusion structures and more senses It is formed by wild Fusion Features structures in series, obtains feature set of graphs F2, F1 and F2 is merged by phase add operation and obtains feature atlas F3 is closed, is then input with F3, λ receptive field amplification structure is passed sequentially through, obtains characteristic pattern output F4；Another branch is using volume The regular volume lamination that product core is 1*1 is handled, and obtains the feature set of graphs R1 in 64 channels, and the output F4 of R1 and previous branch is spliced Obtain the output feature in 256 channels；

The integrated forecasting stage realizes the exchange in characteristic processing stage output feature between 256 channels using the convolution of 1*1, with The classification results of small size are obtained using the convolutional layer of 3*3 afterwards, wherein the output channel number of convolutional layer is equal to the classification of data set Number of tags, 1/8 having a size of original image size, finally using up-sampling layer, by linear interpolation by dimension enlargement to original image size.

3. a kind of realtime graphic semantic segmentation method based on the full convolutional network of lightweight according to claim 1, special Sign is more receptive field Fusion Features structures described in step 1), is made of two concatenated feature processing block A, at the feature Reason modules A is sequentially connected structure by the first convolutional layer, narrow receptive field convolutional layer, wide receptive field convolutional layer and Fusion Features convolutional layer At successively by convolution operation, batch standardization and PReLU activation primitive in each convolutional layer, the first convolutional layer is using 1*1's Convolution kernel, 1/2 that output channel number is input channel number obtain feature set of graphs and are denoted as P1, narrow receptive field convolutional layer, wide impression Wild convolutional layer is using P1 as input, and output channel number is equal to input channel number, using the detachable convolution operation of depth, narrow sense By wild convolutional layer use 3*3 convolution kernel, wide receptive field convolutional layer use 3*3 expansion convolution kernel, the coefficient of expansion 2, then The splicing for passing through characteristic pattern to the output of narrow receptive field convolutional layer and wide receptive field convolutional layer, is input to feature for the feature of splicing Convolutional layer is merged, this layer of convolutional layer uses the convolution kernel of 1*1, and output channel number is equal to input channel number；At two features The processing for managing modules A, obtains feature set of graphs P2, is finally added the input of this structure with P2, obtains output feature set of graphs.

4. a kind of realtime graphic semantic segmentation method based on the full convolutional network of lightweight according to claim 1, special Sign is more size convolution fusion structures described in step 1), is made of two concatenated feature processing block B, the characteristic processing Module B is sequentially connected and is constituted by the first convolutional layer, more size convolutional layers and Fusion Features convolutional layer, wherein more size convolutional layers connect By the output of the first convolutional layer, Fen Sanlu does the detachable convolution operation of depth, uses the convolution kernel of 1*1,3*3,5*5 respectively, Subsequent three tunnels output is input to Fusion Features convolutional layer by splicing, this layer of convolutional layer uses the convolution kernel of 1*1；Finally this is tied The input of structure is added acquisition output feature set of graphs with the output of two feature processing block B.

5. a kind of realtime graphic semantic segmentation method based on the full convolutional network of lightweight according to claim 1, special Sign is the amplification structure of receptive field described in step 1), successively by the first convolutional layer, receptive field amplification layer and Fusion Features convolutional layer It connects and composes, receptive field expands the output of layer the first convolutional layer of receiving as input, and two-way is divided to be the detachable convolution behaviour of depth Make, the first via successively does the detachable convolution operation of depth using the convolution kernel of 1*7 and 7*1, the second tunnel continuous use twice the Two-way is then obtained characteristic pattern splicing and is input to Fusion Features convolutional layer by the convolution operation process used all the way, and output obtains Export feature set of graphs.