CN110210485A

CN110210485A - The image, semantic dividing method of Fusion Features is instructed based on attention mechanism

Info

Publication number: CN110210485A
Application number: CN201910391452.XA
Authority: CN
Inventors: 龚声蓉; 周鹏程
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2019-09-06

Abstract

The present invention discloses a kind of image, semantic dividing method that Fusion Features are instructed based on attention mechanism, includes the following steps: that (10) encoder basic network constructs: being generated using improved ResNet-101 and a series of speaks in a low voice justice to the high semantic feature changed of low resolution by high-resolution；(20) decoder characteristic Fusion Module constructs: using the pyramid structure module based on three-layer coil product operation, extracting the high-level semantic of strong consistency constraint, then to the layer-by-layer Weighted Fusion of low layer phase characteristic, obtains primary segmentation thermal map；(30) auxiliary loss function building: additional back-up surveillance is exported to each fusion of decoding stage, then is superimposed with provost's loss after thermal map up-sampling, strengthens the order training method of model, obtains semantic segmentation figure.The image, semantic dividing method that Fusion Features are instructed based on attention mechanism of the invention, accuracy is high, boundary profile understands.

Description

The image, semantic dividing method of Fusion Features is instructed based on attention mechanism

Technical field

The invention belongs to still image identification technology field, especially a kind of accuracy is high, boundary profile is clearly based on Attention mechanism instructs the image, semantic dividing method of Fusion Features.

Background technique

Semantic segmentation, that is, pixel scale image understanding, is one of important foundation stone of computer vision field, has very It is widely applied scene.It imparts machine for the different zones of visual with Pixel-level in such a way that fine granularity is divided The ability that do not peel away.The pixel region for belonging to same target in image is divided into together by semantic segmentation, to extend it Application field.

Semantic segmentation classifies subjects into while carrying out Pixel-level prediction to be combined together with two problems of target positioning Solve the problem of how higher level of abstraction the accurate target of object classification and low layer position the two mutually constrain between obtain Balance is the current semantics dividing method key problem to be faced.Semantic segmentation method can be roughly divided into two classes.The first, The semanteme of each object in image is generated by manually extracting feature, this method generally requires careful Feature Engineering means, The classification that classifier carries out pixel scale is inputted again.Second is to be based on deep learning method, will by building end-to-end system Feature extraction and classifying device is combined directly to distribute a semantic label for each pixel.

The machine learning method that most of traditional methods have relied on manual extraction feature and combine with classifier, Such as the Boost method of Shotton et al., the random forest of Johnson et al., the support vector machines of Soatto et al..These sides Method achieves substantive progress by integrating the abundant information from context and structuring Predicting Technique.However, due to hand The limited influence of the feature representation ability that work is extracted, the image, semantic segmenting system performance based on conventional machines learning method is gradually Saturation, can not breakthrough bottleneck, still have greatly improved space in segmentation accuracy rate performance.

In recent years, deep learning revolution allowed related fields that earth-shaking variation has occurred, including semantic segmentation Many computer vision problems all begin to use deep layer framework to solve.The full convolution net proposed based on depth convolutional neural networks Network method replaces full articulamentum to construct full convolutional network and is applied in semantic segmentation with convolutional layer, generate it is intensive pixel-by-pixel Label output, obtains higher segmentation precision.Zhao et al. proposes that pyramid scene parses network method, utilizes pyramid pond Change module, global context information is utilized by the context polymerization of different zones, is effectively produced using global priori The segmentation result of high quality.Li et al. people is put by first classifying to shallow-layer phase zone, and by deeper stage emphasis On minority difficult region, to carry out study adaptive and for the identification of difficult sample, segmentation performance is ultimately improved.Lin Et al. propose a kind of general multi-path optimization network method, explicitly use all available informations during down-sampling, with It realizes and is predicted using the high-resolution pixel grade of long-range residual connection.

However, the prior art is deposited for semantic segmentation effect, there are still two main problems:

1, in the image, semantic segmentation based on the full convolutional network of depth, when carrying out feature extraction using convolutional network, by Cause feature resolution to gradually decrease in the repeated combination of convolution, maximum pond and down-sampling operation, contextual information is caused to be lost It loses, so that the regional area misrecognition for occurring appearance complex target in segmentation result and multiple dimensioned object Small Target is caused to be known Mistake etc. is not semantic inconsistent；

2, the success section of convolutional network is attributed to its inherent invariance to image local transformation, which enhances The ability of e-learning Hierarchical abstraction, this is exactly needed for object classification contour level visual task.And semantic segmentation is solving Also need to position the spatial details problems such as the boundary profile of object, simple pixel classifications while classification problem in segmentation Often there is the smudgy phenomenon of boundary profile of object in segmentation result in task.

Summary of the invention

The purpose of the present invention is to provide a kind of image, semantic dividing method for instructing Fusion Features based on attention mechanism, Accuracy is high, boundary profile understands.

Realize the technical solution of the object of the invention are as follows:

A kind of image, semantic dividing method being instructed Fusion Features based on attention mechanism, is included the following steps:

(10) encoder basic network constructs: being spoken in a low voice using improved ResNet-101 generation is a series of by high-resolution Feature of the justice to the high semantic variation of low resolution；

(20) decoder characteristic Fusion Module constructs: using the pyramid structure module based on three-layer coil product operation, extracting The high-level semantic of strong consistency constraint, then to the layer-by-layer Weighted Fusion of low layer phase characteristic, obtain primary segmentation thermal map；

(30) auxiliary loss function building: exporting additional back-up surveillance to each fusion of decoding stage, then on thermal map Provost after sampling loses superposition, strengthens the order training method of model, obtains semantic segmentation figure.

Compared with prior art, the present invention has the following:

1, accuracy is high: the method for the present invention is by realizing the end high-layer semantic information extraction module similar to pyramid structure The feature of three different scales is merged, and additionally introduces global pool branch connect with output feature and does subsequent processing, general Contextual information is multiplied with the former feature after simple convolutional operates, and can capture under the premise of not introducing too many calculating Strong semantic consistency feature reduces the probability of object regional area identification error；

2, boundary profile understands: the present invention contains more semantic information and low-level feature according to high-level characteristic between adjacent feature Containing this feature of more space detailed information, first connects two hierarchy characteristics generation channels and pay attention to vector, as weight Select in low-level feature the most information of judgement index, using the strong semantic consistency constraint guidance of high-level characteristic and refinement its with The fusion of low-level feature, captures context abundant, has finally refined the partitioning boundary of object, preferably fusion hierarchy characteristic with The edge details for restoring object in segmentation figure, reduce the smudgy phenomenon of boundary profile.

Detailed description of the invention

Fig. 1 is the main flow chart that the image, semantic dividing method of Fusion Features is instructed the present invention is based on attention mechanism.

Fig. 2 is the flow chart of encoder basic network construction step in Fig. 1.

Fig. 3 is the flow chart of decoder characteristic Fusion Module construction step in Fig. 1.

Fig. 4 is end high-layer semantic information extraction module example.

Fig. 5 is that attention mechanism instructs Fusion Features module example.

Specific embodiment

As shown in Figure 1, the present invention is based on the image, semantic dividing method that attention mechanism instructs Fusion Features, including it is as follows Step:

As shown in Fig. 2, (10) the encoder basic network construction step includes:

(11) number of plies of structure block is disposed again: the building number of blocks that res-2 to the res-5 stage respectively possesses is redeployed, { 3,4,23,3 } building number of blocks of res-2 to the res-5 of original ResNet-101 is adjusted to { 8,8,9,8 }；

The purpose of convolutional network encoder is to generate a series of to be changed by high-resolution justice of speaking in a low voice to low resolution is high semantic Feature.The basic network usually using existing convolutional neural networks model, as LeNet, AlexNet, VGG, GoogLeNet, ResNet etc..Wherein ResNet-101 has used a large amount of residual error structure, and what gradient disappeared while solving number of plies intensification asks Topic, each of which residual error structure also provides new path for positive and backpropagation, therefore has extremely strong ability to express.This hair The bright encoder basic network for using ResNet-101 as semantic segmentation.

In basic network, feature is extracted from each stage tail portion of encoder section, for ResNet-101 Speech, respectively res-2, res-3, res-4 and res-5 four-stage, wherein contained building number of blocks of each stage be respectively 3, 4,23,3 }, each structure block is made of three convolutional layers.As it can be seen that the first two coding stage of ResNet-101 only have it is a small amount of Structure block, the convolution number of plies shallower in this way prevent it from the semantic feature of extracting deep layer, and the semanteme of low-level feature is second-rate.And Since res-4, after a large amount of deep layer convolution, the feature of output possesses stronger semanteme.Res-3 and res-4 two The semantic quality gap between two adjacent features that stage extracts is very big.In order to improve the semantic quality of low-level feature, make , closer to supervision, a direct method is to redeploy the building number of blocks that res-2 to the res-5 stage respectively possesses, flat for it Weigh the convolution number of plies in each stage, the semantic difference between feature that two stages of res-3 and res-4 of reduction export.Again In building of each stage number of blocks of deployment, { 3,4,23,3 } of res-2 to the res-5 of original ResNet-101 construct number of blocks It is adjusted to { 8,8,9,8 }.

(12) expand receptive field: traditional convolution in res-5 stage in ResNet-101 infrastructure network is changed to expand The empty convolution that rate is 2.

The output resolution ratio of semantic segmentation should be consistent with input picture.Although the semantic segmentation method based on full convolutional network It can receive the input picture of arbitrary resolution, but continuous convolution sum pondization operation also reduces while increasing receptive field The resolution ratio of feature.Although the characteristic pattern of diminution can be reverted to the original size of image, this process by up-sampling Necessarily cause the information lost that can not restore, sensibility to image detail will be lost by up-sampling the characteristic pattern of recovery.Also, frequency Numerous up-sampling operation is also required to additional memory and time.The present invention uses and is initially applied to field of signal processing wavelet transformation Empty convolution method in analysis overcomes the problems, such as this.

Original filter is up-sampled 2 times, and is inserted into zero between filter value.Although the size of effective filter has Increased, but without considering the intermediate zero being inserted into, i.e., it is empty, therefore the operand of the quantity of filter parameter and each position Amount remains unchanged.Can be by changing spreading rate parameter r to be adaptively modified the size of receptive field, and then efficiently control volume The resolution ratio of feature is without learning additional parameter in product network.

In convolutional neural networks, after continuous Standard convolution of 3 core having a size of 3 × 3, receptive field size is respectively 3 × 3,5 × 5 and 7 × 7.If the core of continuous convolution operation is having a size of (2d+1) × (2d+1) and constant, receptive field of n-th layer Size are as follows:

f_n=2dn+1 (1)

I.e. the size of receptive field linearly increases under Standard convolution.And be core shown in Fig. 2 having a size of 3 × 3, spreading rate divides Not Wei 1,2 and 4 empty convolution, receptive field is respectively 3 × 3,7 × 7 and 15 × 15.Assuming that core size is similarly (2d+1) The spreading rate of n-th layer is r in × (2d+1) constant continuous empty convolution operation_n, then the size of receptive field are as follows:

f_n=f_n-1+2dr_n (2)

Wherein n >=2 and f₁=2dr₁+ 1, recursion can obtain:

Enable spreading rate r_n=2^n-1, then the size of receptive field becomes:

f_n=2d (2ⁿ-1)+1 (4)

Receptive field exponentially type growth can be made by choosing spreading rate appropriate as a result, for empty convolution.In basic network knot The res-5 stage in structure begins to use the empty convolution of spreading rate 2, because res5a and res5c uses 1 × 1 filter in the stage Wave core, so practical, only res5b is in rapid expansion receptive field, to extract intensive feature.

Using the structure based on decoder module for restoring image resolution ratio, pass through the feature of each level of integration therebetween To refine final prediction.Decoder architecture mainly considers how the sky for restoring to lose by continuous pondization and down-sampling operation Between information.The present invention designs a terminus module in decoder architecture and is mainly used for extracting the height with the constraint of most strong consistency Layer semantic information, and merged using the guidance of attention mechanism with low-level feature, refinement exports result.

As shown in figure 3, (20) the decoder characteristic Fusion Module construction step includes:

(21) it extracts end high-layer semantic information: using the similar pyramidal construction module based on three-layer coil product operation, The convolution for using 3 × 3,5 × 5 and 7 × 7 respectively in the module obtains having most strong class by merging the context of different scale The high-level semantic of interior semantic consistency；

Previous model is mostly to execute empty pyramid pond or the void space of a series of scales in basic network end Pyramid module.In current semantic segmentation system, pyramid structure can extract the characteristic information of different scale, and in picture Plain rank expands receptive field, but this structure lacks global context priori, the element that can not be suitble to by channel selecting, and Important Pixel-level information may be lost.For example, excessively frequent empty convolution can cause local message to be lost and grid pond Locally coherence to characteristic pattern is also harmful.The pyramid pond module proposed in PSPNet is even more often can be in different rulers Location of pixels is lost in the pondization operation of degree.

The present invention, which is extracted using high-layer semantic information extraction module as shown in Figure 4 from basic network end, to be had in class by force The high-level characteristic of semantic consistency constraint.

The module merges the characteristic information of three different scales by realizing structure as similar pyramid.In order to more Useful context is extracted from different scale well, 3 × 3,5 × 5 and 7 × 7 convolution is used in the module respectively, due to height Layer feature resolution is smaller, therefore will not bring too big computation burden.The feature that the module passes through gradually fusion different scale Information more can accurately combine the contextual feature of adjacent scale.Output feature from res-5 is by 1 × 1 After convolution, it is multiplied with fusion feature by channel.The module also additionally introduces the global pool branch connecting with output feature, rear It can be further improved the performance of semantic segmentation in continuous processing.

Similar pyramidal structure is benefited from, end high-layer semantic information extraction module can merge the upper and lower of different scales Literary information, while powerful semantic information is generated for high-level characteristic.With pyramid pond module cut channel convolutional layer it The feature of preceding connection different scale is different, and end high-layer semantic information extraction module is by contextual information and by simple 1 × 1 Former feature after convolution operation is multiplied, and will not introduce too many calculating.

(22) integrating context feature: paying attention to vector by successively merging the feature of adjacent phases to calculate channel, The characteristic information that judgement index is strong in the low layer stage is selected in this, as weighting, and is blended with adjacent high-stage feature, is obtained Divide thermal map.

In basic network, ResNet-101 includes five stages, generates the feature of corresponding scale respectively, and different phase is gathered around There is different recognition capabilities to lead to a variety of consistent sex expressions.In the low layer stage, network code goes out fine spatial information, small sense By wild and lacking the characteristic that spatial context guides makes it contain only a small amount of semantic consistency.In the high-rise stage, because of its big impression It is wild and possess semantic consistency in powerful class, but foreseeable spatial accuracy is very coarse.In short, the low layer stage generates more Accurately spatial prediction and the high-rise stage can provide more accurately semantic forecasts.It is possible thereby to the respective advantage of the two is combined, The Fusion Features of guidance and low order section are removed using the semantic consistency of high-stage to obtain optimum prediction.The present invention is used as schemed Attention mechanism shown in 5 instructs Fusion Features.

This design pays attention to vector as weight by merging the feature of adjacent phases to calculate a channel.High-level characteristic Powerful consistency guidance is provided, and there is the information of different discriminating powers in the feature that the low layer stage provides.Channel pay attention to Amount for weighting, selects the strong characteristic information of judgement index.In semantic segmentation framework, convolution operation final output score Figure, to provide probability different classes of belonging to each pixel.Score in final score figure in characteristic pattern by owning Channel summation and all channels summations in characteristic pattern of the score in final score figure:

Wherein x represents the feature of network output, and ω indicates convolution kernel, and D is the set of location of pixels.

In formula (6), p is prediction probability, N, that is, port number.In formula (5) and formula (6), final prediction label is probability value That highest classification.Assuming that the prediction result of some classification isAnd true tag is y, then, introducing a parameter alpha will most High probability values byBecome y, as shown in formula (7):

Wherein, y is new prediction output, and the Sigmoid output in α=Sigmoid (ω, x), i.e. Fig. 4.

Based on above-mentioned analysis, it can be seen that the Analysis of Deep Implications of attention mechanism.Formula (5) impliedly discloses different channels Weight is equal.But as noted, the feature of different phase possesses different degrees of discriminating power, so as to cause not Same prediction fine granularity.In order to obtain the prediction result with fine object bounds, should extract as much as possible with discriminating power Feature and the feature that inhibits discriminating power weak.Therefore, the α value in formula (7) is applied to characteristic pattern x, indicates attention mechanism Feature selecting.There is this module, can successively refine, exports optimal prediction result.

The present invention improves the loss function of common semantic dividing method, supervises strategy using a kind of layer-by-layer label, By the additional back-up surveillance of the feature directly exported to each fusion of decoding stage, to promote each layer of branch in network model Learning ability.In order to which in auxiliary branch's generative semantics output, each fusion feature is entering next step as the high-rise stage Before be forced to learn more semantemes, with expectation to it is subsequent fusion it is more helpful.It should be noted that with encoder stage The structure block number of plies is disposed equally again, and layer-by-layer label supervision itself can not promote the classification feature of convolutional network, only in semanteme This measure can make convolutional network be forced to promote the semantic quality of low layer phase characteristic in segmentation task, thus to decoding stage Output it is more helpful.

When trained network, differentiated in the corresponding Fusion Features module tail portion res-2, res-3 and res-4 addition etc. Rate marks the auxiliary softmax loss of figure.The entire final Classification Loss of model be equivalent to final output supervision and three Assist the sum of the supervision of branch.

In 3 given branches and the total T=4 supervision of final output, the feature port number of each surveillanced object output is Classification number in training set is N.Feature F after t-th of branch end up-sampling^tSpatial resolution be W_t×H_t, correspond to The value of preferred coordinates position (w, h, n) is F^t _w,h,n.Weight is added to the characteristic pattern supervision of final output and each branch Softmax intersects entropy loss, respective weights λ_t, wherein λ₀=1 is the loss weight of final output, remaining is back-up surveillance Loss.By F^tIt is input in softmax function, calculates each pixel in image and belong to different classes of probabilitysoftmax The specific formula of function layer are as follows:

It will predictionIt is mapped to true tag P^t _w,h,nOn, eventually for shown in trained loss function such as formula (9):

Layer-by-layer label supervision strategy makes gradient optimizing more smooth, and model is also easier to train.Supervision under each Branch respectively possesses powerful learning ability, can acquire each level semantic feature abundant.By fusion so that final obtain The segmentation figure precision arrived is independent of any individual branch.

Below with a specific embodiment, verifying the method for the present invention can be improved the accuracy of image, semantic segmentation.

To several amendment moulds proposed on two semantic segmentation data sets of PASCAL VOC 2012 and Cityscapes Block is tested.Basic network is the ResNet-101 of the pre-training on ImageNet.Experimental Hardware platform is at Core i7 Device is managed, 3.6GHz dominant frequency, 48G memory, GPU is NVIDIA GTX 1080, and code operates in TensorFlow deep learning frame On frame.

1, ablation experiment

This section gradually decomposes proposed method, verifies each validity for setting up module, in next experiment, Assessment and more resulting data on the verifying collection of PASCAL VOC 2012.Firstly, based on original ResNet-101 conduct Basic network, and output is directly up-sampled in end, as shown in table 1.

Scaling enhances the effect after data set with overturning to table 1 at random

Then, basic network is expanded to based on FCN coding-decoding architecture Fusion Features, Fusion Features strategy uses It cuts and simply sums after up-sampling by channel.In order to examine the validity of this Fusion Features, series of features subset is selected The effect of each phase characteristic fusion is listed, and effect compares after dispose with each stage structure block number of plies again, as shown in table 2.

The 2nd column can be found out with being apparent from table 2, merge more hierarchy characteristics and gradually improve segmenting system really Output quality, however when merging more low-level features in the backward, overall performance quickly tends to be saturated.Watershed is The res-4 stage of ResNet-101, the stage share 23 structure blocks totally 69 convolutional layers so that the res-2 stage export it is low There are huge semantic gap between the low-level feature that layer feature and res-3 stage export, this estrangement to merge res-3 rank Overall performance is promoted almost nil when the low-level feature of section output, and the effect that the later period continues fusion is not also significant.

The 2 structure block number of plies of table disposes the effect of front and back fusion feature again

It follows that the fusion between the huge feature of otherness is substantially invalid.The 3rd column show four structures in table Build the Fusion Features effect block number of plies is disposed again after.When initial, res-5 output feature up-sampling after segmentation mass ratio it It is again slightly too late before deployment, but almost can be ignored, it was confirmed that the structure block number of plies is disposed again does not strengthen convolution net The classification capacity of network itself.Unlike the 2nd column, when with merging more low-level features backward, performance is promoted steadily, though Right promotion paces are simultaneously unstable, but are also rapidly saturated unlike in the 2nd column.Weight deployment mechanisms make ResNet-101 original The building block number of res2, res-3, res-4 and res-5 four-stage become { 8,8,9,8 } by { 3,4,23,3 } so that respectively Stage exports the estrangement in feature between feature and changes relative decrease, Fusion Features better effect, and is better than disposing again in last Preceding performance is higher by 0.52 percentage point.

Table 3 shows the validity of entire model each section component.

Ablation Experimental comparison on 3 PASCAL VOC2012 of table verifying collection

The semantic information that end high level extracts contains powerful semantic consistency, by powerful semantic constraint, gradually It is merged to the low layer stage, obtains more careful image, semantic feature, model performance is improved 1.1%.Attention mechanism is The most important improvement of entire model.Different from the fusion method simply summed by channel, the channel which generates pays attention to vector The information of most judgement index in low-level feature is selected, so that the boundary of refined object segmentation well, is promoted on aforementioned base The performance of model 2.06%, the other assemblies module of ratio, contribution are maximum.Final layer-by-layer label supervision has refined fused Hierarchy characteristic makes each fused feature further towards supervision, to the performance boost of entire model 0.43%.End For module other than generating high-layer semantic information, there are one branches to export global pool feature.Using global pool feature into One step constrains the fused output of res-2 stage low-level feature, enhances entire model when handling image to all pictures of target The semantic consistency of element.Global pool branch improves the performance of model 0.96%, has important value.

2, qualitative analysis

The image, semantic segmentation effect of visualization of several comparative approach is illustrated in table 4.

The visualization of 4 parts of images segmentation effect of table

Third column and the 5th column in table, FCN method show that object regional area misidentifies phenomenon.It is right in third column original image Like three oxen, wherein two in contrast scale it is smaller.FCN basic network shows target object regional area identification mistake The problem of, preceding two feet of the biggish ox of scale are close with ground visual effect, appearance slightly complicated, although there is certain segmentation Effect but the sole for being mistakenly classified as horse.Ox lesser for two is even more many pixel region classification errors, mistake classification Region be also misidentified as horse, thus it is speculated that should be ox in training set scale it is generally large, model can not handle well compared with Small similar target.The invention shows effects almost ideal out, and answering right FCN to lose image detail well leads to ground object The problem of local pixel classification error.5th column image is Baima on white railing side, and visually horseback and horse leg are by railing It blocks.Since color is close, FCN method does not identify the horse leg under the horseback part of railing or more, railing are blocked directly Show fuzzy result.The present invention is slightly perfect, in addition to error in judgement occurs in few partial pixel, there is no problem.

First and second and four in column, and FCN method shows the ambiguity on cutting object boundary.First column input picture is one Sheep, partial tone gloss and earth background in the egative film of black and white on chimera are close to the highest white of brightness.FCN points A part that part light tone earth background in result is misidentified as goat body is cut, and misrecognition region is more at random, this hair Misrecognition area at random is eliminated well in the bright segmentation figure obtained on the basis of attention mechanism, boundary is very clear, It is splendid to the binding effect of segmentation.Cabinet in second column adds the horse racing in display and the 4th column to be also such.

3, it is quantitatively evaluated

The present invention has carried out the experiment of several method on 2012 enhanced edition of PASCAL VOC and Cityscapes data set As a result quantitative analysis is compared with.Test result is as shown in table 5.

5 attention Mechanism Model of table is on 2012 test set of PASCAL VOC by the accuracy rate of classification

When the present invention is compared with DeepLab, there is the classification accuracy rate of half or so to be higher than DeepLab, and part class Other accuracy rate, which belongs to, to be much higher by, and final total accuracy rate is slightly above DeepLab.It is compared in the LRR method with forward position When, present invention accuracy rate with higher in most of classification, wherein bicycle, ship, bottle, chair, potting, sofa are electric Depending on etc. be higher by 3% than LRR in classifications, some is even higher by 15% to 20%, these classifications are all that segmentation difficulty is larger and easily mixed The classification confused, the method for the present invention have meticulously merged the features of multiple low-levels by high-level semantic guidance, thus processing have compared with The bicycle of multi-semantic meaning details, chair, with the advantage in feature extraction when the classifications such as potting, the target being partitioned into has very strong Semantic consistency, seldom there is the problems such as regional area misrecognition.For milk cow, sheep, dog etc. has the classification of similar appearance Target can also distinguish complicated semantic classes.

Finally, also assessing method of the invention on Cityscapes data set.In the training process, every image is cut out It is cut into 800 × 800.It makes discovery from observation for high-definition picture, large-scale cut is very useful.Model is being surveyed Performance on examination collection is as shown in table 6.It is similar with 2012 situation of PASCAL VOC, segmentation of the present invention in most of object In achieve it is best as a result, and better than other methods on final part.

By the accuracy rate of classification on 6 attention Mechanism Model Cityscapes test set of table

The semantic information of different levels is embedded in characteristic pattern using convolutional network encoder by the present invention, and it is whole to reuse decoder Each characteristic pattern refinement is closed to export and generate final segmentation result.

Encoder is the pre-training convolution model for extracting characteristics of image, and top feature has the semanteme of height, But due to lack of resolution, the scarce capacity in terms of the fine detail for rebuilding segmentation figure.And the feature of encoder bottom end has height Resolution details but lack powerful semantic information.Encoder has redeployed the building number of blocks in each stage with balance characteristics Between semantic difference variation, and in res5b block using spreading rate be 2 empty convolution.Pass through end high-layer semantic information Extraction module generates powerful semantic consistency constraint, is successively merged top-downly in decoding stage by attention mechanism low The high-level characteristic of resolution ratio and high-resolution low-level feature instruct feature to melt using the powerful semantic consistency of high-level characteristic It closes, to generate high-resolution semantic results.

Claims

1. a kind of image, semantic dividing method for instructing Fusion Features based on attention mechanism, which is characterized in that including walking as follows It is rapid:

(10) encoder basic network construct: using improved ResNet-101 generate it is a series of by high-resolution speak in a low voice justice to The feature of the high semantic variation of low resolution；

(20) decoder characteristic Fusion Module constructs: using the pyramid structure module based on three-layer coil product operation, extracting strong by one The high-level semantic of cause property constraint, then to the layer-by-layer Weighted Fusion of low layer phase characteristic, obtain primary segmentation thermal map；

(30) auxiliary loss function building: additional back-up surveillance is exported to each fusion of decoding stage, then is up-sampled with thermal map Provost afterwards loses superposition, strengthens the order training method of model, obtains semantic segmentation figure.

2. image, semantic dividing method according to claim 1, which is characterized in that

(10) the encoder basic network construction step includes:

(11) the structure block number of plies is disposed again: the building number of blocks that res-2 to the res-5 stage respectively possesses is redeployed, it will be original { 3,4,23,3 } building number of blocks of res-2 to the res-5 of ResNet-101 is adjusted to { 8,8,9,8 }；

(12) expand receptive field: it is 2 that traditional convolution in res-5 stage in ResNet-101 infrastructure network, which is changed to spreading rate, Empty convolution.

3. image, semantic dividing method according to claim 1, which is characterized in that

(20) the decoder characteristic Fusion Module construction step includes:

(21) end high-layer semantic information is extracted: using the similar pyramidal construction module based on three-layer coil product operation, in mould The convolution for using 3 × 3,5 × 5 and 7 × 7 in block respectively obtains having language in most strong class by merging the context of different scale The high-level semantic of adopted consistency；

(22) integrating context feature: vector is paid attention to calculate channel by successively merging the feature of adjacent phases, with this The characteristic information that judgement index is strong in the low layer stage is selected as weighting, and is blended with adjacent high-stage feature, is obtained preliminary Divide thermal map.