CN110472493A

CN110472493A - Scene Segmentation and system based on consistency feature

Info

Publication number: CN110472493A
Application number: CN201910604601.6A
Authority: CN
Inventors: 唐胜; 伍天意; 李锦涛; 张勇东
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2019-11-19
Anticipated expiration: 2039-07-05
Also published as: CN110472493B

Abstract

The present invention proposes a kind of Scene Segmentation and system for being based on consistency feature (ConsensusFeatures), feature including learning to feature extractor carries out the transformation of example consistency and the transformation of classification consistency, transformed feature is input to scene cut sub-network, obtains the scene cut result of original image.The invention proposes a kind of consistency features that study example hierarchy is gone by example consistency converter unit.On the other hand, since, there is multiple similar examples, use classes consistency unit of the present invention goes the consistency feature of study class hierarchy in scene image.The two units greatly improve the performance of the existing scene cut model based on full convolution.

Description

Scene Segmentation and system based on consistency feature

Technical field

The present invention relates to machine learning and computer vision field, and in particular to a kind of based on Kronecker convolution sum tree The Scene Segmentation and system of shape structure feature aggregation module.

Background technique

Semantic segmentation has become an important composition component in scene understanding, and plays the part of in many application fields Very important role, such as automatic Pilot, self-navigation and virtual reality.The target of semantic segmentation is to every in image A pixel is classified.Depth convolutional neural networks achieve significant progress in semantic segmentation field.Current most of prevalences Semantic segmentation work be all based on image classification network.However, using image classification network as the spy of semantic segmentation model Levying extractor can having some limitations property.Image classification network tendency is indicated with the image level for learning entire input sample.Before Work show that the characterization of this image level is often dominated by the ga s safety degree region of foreground object or conspicuousness object, such as The head of horse, the face of dog.Then semantic segmentation target is that each pixel is divided in an image, therefore Pixel-level is characterized in It is required.Therefore two defects will directly be will lead to semantic segmentation using image classification network, as shown in Fig. 2 (d): (1) main It leads in the class of all spatial positions of object that feature is inconsistent, causes the segmentation result of the same example class inconsistent；(2) can not Feature is easy to be confused between the class of distinguishable region (such as secondary object or filling class region), causes similar example to have different The segmentation result of cause.To solve the above-mentioned problems, it is desirable to go the consistency feature of study Pixel-level.Consistency be characterized in by Neighbours' consistency related work is encouraged, it is that object matching field is used to find the direct reliable intensive sound of a pair of of image It answers.The present invention, our targets are study consistency features, it refers to all spies in an example or in same class example Levy indifference.As shown in Fig. 2 (a), the feature (such as B and C) of different zones should keep instance-level in (1) same example Consistency；(2) (for example A should keep consistent on class hierarchy with C) to the feature of the different zones of the different instances of the same category Property.In order to learn consistency feature, the present invention proposes two consistency converter units, including example consistency converter unit (InstanceConsensusTransformunit, ICTunit) and class consistency converter unit (CategoryConsensusTransformunit,CCTunit).Example consistency unit is expected to take study example hierarchy Consistency feature.Particularly, the localized network (LocalNetwork is abbreviated as LN) that we have introduced a lightweight goes benefit With the transformation parameter for around contextual information being each pixel generation instance-level.Then we utilize the transformation parameter of example hierarchy Take the feature being aggregated in the same example.On the other hand, since there is multiple similar examples, the present invention in scene image Use classes consistency unit goes to pursue the consistency feature of class hierarchy.Specifically, we have introduced the overall situation of a lightweight Network (GlobalNetwork is abbreviated as GN) goes to generate the consistency transformation parameter of class hierarchy.It needs to model different from LN, GN Any position and its

All interactions of his position.Two units proposed by the present invention are that the mode of data-driven learns, and trains Additional supervision is not needed in journey.The two units are used to update the feature of all positions.For each position,

The two units are all the information (foreground area) of adaptive enhancing relevant position and inhibit incoherent (background Region).Therefore consistency is characterized in maintaining the invariance to the variation of background in foreground area indifference.Learn with benchmark model To characteristic pattern 2 (c) compare, method proposed by the present invention learns to be characterized in that more cohesion is in example hierarchy and classification layer It is secondary, as shown in Fig. 2 (e).Meanwhile the method that the inconsistent segmentation result in Fig. 2 (d) is suggested is corrected, such as Fig. 2 (f) institute Show.Example consistency converter unit and classification consistency converter unit based on proposition, we have proposed one to be called consistency The semantic segmentation frame of character network (ConsensusFeatureNetwork is abbreviated as CFNet) goes the consistent of study Pixel-level Property feature, and obtain consistent segmentation result.By in four famous semantic segmentation data sets, including Cityscapes and PASCALContext, obtain be more than current best method precision, it was demonstrated that the validity of proposition method of the present invention.

Summary of the invention

In view of the deficiencies of the prior art, the present invention proposes a kind of Scene Segmentation based on consistency feature, feature It is, comprising:

Step 1 uses residual error network as feature extractor, extracts the local feature map in original image, and to this Local feature map carries out the transformation of example consistency, obtains the consistency feature of example hierarchy；

Step 2 carries out the transformation of classification consistency to the consistency feature of the example hierarchy, obtains the consistency of class hierarchy Feature；

Step 3, using category consistency feature as input, the field of the original image is exported by scene cut sub-network Scape segmentation result.

The Scene Segmentation based on consistency feature, which is characterized in that the example consistency, which converts, includes:

For a local feature mapC is the number of channels for levying map, and H × W is space size, first First with 11 × 1 convolution dimensionality reduction local feature map, characteristic spectrum is obtainedC₁It is the logical of characteristic spectrum Road quantity generates the parameter of example consistency transformation after obtaining characteristic spectrum PHere r is indicated to work as Regional area size centered on front space position, the size of parameter θ are converted with the transformation of regional area size r；

Dimension is carried out to convert to obtainThe operation that N=H × W, characteristic spectrum P execute an expansion, which goes to extract, to be slided The feature of dynamic window block, and carry out dimension and convert to obtain characteristic spectrumNew characteristic spectrum Q,Wherein functionFor the Element-Level multiplication of tensor x and y, and asked according to the last one dimension With；

Dimension is carried out to characteristic spectrum Q to convert to obtain the consistency feature of the example hierarchy:

The Scene Segmentation based on consistency feature characterized by comprising building classification consistency transformation Unit carries out the transformation of category consistency using consistency feature of the category consistency converter unit to example hierarchy:

Category consistency converter unit includes: that memory network and 1 convolutional layer go instantiation complete to two two-way length in short-term Office network；

First two-way length in short-term memory network by from bottom to top with it is top-down in a manner of go to scan the example hierarchy The characteristic spectrum of consistency feature, two-way length memory network in short-term specifically:

For scanning rule,For the output state of t moment；

It cascades and hides layer stateWithThe characteristic spectrum H1 of a mixing is obtained, and uses second two-way length Short-term memory network takes horizontal direction to scan this feature map H1, to cascade the characteristic pattern that the state of forward and backward is mixed H2 is composed, this feature map H2 is input to the convolutional layer, obtains the consistency transformation parameter of class hierarchy, converts and joins to the consistency Number is normalized to obtain transformation parameter φ by activation primitive；

New characteristic spectrum F can be generated as follows:

Wherein functionFor the tensorial multiplication between tensor x and y, E is characterized figure, the characteristic pattern new to this Spectrum F carries out dimension and converts to obtain the characteristic spectrum of the consistency feature of category level

The Scene Segmentation based on consistency feature, which is characterized in that the activation primitive is Softmax letter Number.

The Scene Segmentation based on consistency feature, which is characterized in that the convolutional layer uses 1 × 1 convolution Core.

The invention also provides a kind of scene cut systems based on consistency feature characterized by comprising

Module 1 uses residual error network as feature extractor, extracts the local feature map in original image, and to this Local feature map carries out the transformation of example consistency, obtains the consistency feature of example hierarchy；

Module 2 carries out the transformation of classification consistency to the consistency feature of the example hierarchy, obtains the consistency of class hierarchy Feature；

Module 3, using category consistency feature as input, the field of the original image is exported by scene cut sub-network Scape segmentation result.

The scene cut system based on consistency feature, which is characterized in that the example consistency, which converts, includes:

The scene cut system based on consistency feature characterized by comprising building classification consistency transformation Unit carries out the transformation of category consistency using consistency feature of the category consistency converter unit to example hierarchy:

For scanning rule,For the output state of t moment；

It cascades and hides layer stateWithThe characteristic spectrum H1 of a mixing is obtained, and uses second two-way length When memory network take horizontal direction to scan this feature map H1, to cascade the characteristic spectrum that the state of forward and backward is mixed H2, this feature map H2 are input to the convolutional layer, obtain the consistency transformation parameter of class hierarchy, to the consistency transformation parameter It is normalized to obtain transformation parameter φ by activation primitive；

New characteristic spectrum F can be generated as follows:

The scene cut system based on consistency feature, which is characterized in that the activation primitive is Softmax letter Number.

The scene cut system based on consistency feature, which is characterized in that the convolutional layer uses 1 × 1 convolution Core.

As it can be seen from the above scheme the present invention has the advantages that

Based on the Scene Segmentation and system of consistency feature (ConsensusFeatures), including to feature extraction The feature that device learns carries out the transformation of example consistency and the transformation of classification consistency, and transformed feature is input to scene cut Sub-network obtains the scene cut result of original image.The invention proposes one kind to go to learn by example consistency converter unit Practise the consistency feature of example hierarchy.On the other hand, since, there is multiple similar examples, the present invention uses in scene image Classification consistency unit goes the consistency feature of study class hierarchy.The two units greatly improve existing based on full convolution The performance of scene cut model.

Detailed description of the invention

Fig. 1 is structure of the invention figure；

Fig. 2 is to utilize the result schematic diagram of image classification network to semantic segmentation in the prior art.

Specific embodiment

To allow features described above and effect of the invention that can illustrate more clearly understandable, special embodiment below, and cooperate Bright book attached drawing is described in detail below.

Step 1, the deep layer that sub-network is extracted in essential characteristic go study real using example consistency converter unit The consistency feature of example level.The example consistency converter unit, which is utilized, to be gone around contextual information as the generation of each spatial position Transformation parameter.As shown in Fig. 1 (b), for a local feature mapC is the number of channels for levying map, H × W is space size, and example consistency converter unit removes dimensionality reduction first with 11 × 1 convolution and saves calculating, obtains simultaneously Characteristic spectrumHere C1 is the number of channels of characteristic spectrum, it be usually C (typically).It is obtaining After obtaining characteristic spectrum P, example consistency converter unit employs the localized network of a lightweight to go to generate the change of example consistency The parameter changedHere r indicates the regional area size centered on current spatial location.The size of parameter θ It is to be converted with the transformation of regional area size r.It is desirable that localized network is gone using the circular contextual information of each point Generate transformation parameter.We go to instantiate the localized network using two convolutional layers, their filter size is respectively r × r With 1 × 1.The characteristic spectrum conduct that first convolutional layer (r × r) is used to capture ring around contextual information, after then obtaining The input of second convolutional layer (1 × 1), then output is the parameter of example consistency transformation.Then we are swashed using softmax Function living goes to obtain normalized transformation parameter θ, then carries out dimension and converts to obtainHere N=H × W.Together When, the operation that characteristic spectrum P executes an expansion goes to extract the feature of sliding window buccal mass, and carries out dimension and convert to obtain characteristic pattern SpectrumWe push up a functionFor the Element-Level multiplication of tensor x and y, then tieed up according to the last one Degree is summed.New characteristic spectrumIt can calculate in following way:

It converts to obtain next, we carry out dimension to QIn fact, the spy of (i, j) in any position Levy vectorInIt is the neighbours on characteristic spectrum PBecome with corresponding example level consistency Change parameterWeighted sum, hereIt is a r × r centered on (i, j) Rectangular area.Therefore in each position (i；J) map function can formalize are as follows:

Here

The Scene Segmentation further include:

Step 2, the consistency transformation that the consistency feature of the example hierarchy learnt is carried out to class hierarchy again, the step 2 include:

The consistency feature of study class hierarchy is gone using tired consistency converter unit.Category consistency converter unit Structure such as Fig. 1 (c) shown in, we using a global network go generate class hierarchy consistency transformation parameterHere N=H × W.It is desirable that this global network ability removes " seeing " entire input feature vector map.One very Natural solution mode is using a full articulamentum, global convolution, or the multiple big nuclear convolutions of stacking.These modes are not Very effectively, because they have introduced a large amount of parameter and video memory expense.

Classification consistency converter unit has introduced a cyclic convolution neural network and has gone the modeling other dependence of region class, Using two two-way length, memory network (BiLSTM) and 11 × 1 convolutional layer go instantiation global network (GN) in short-term for we.I First with first two-way length in short-term memory network by from bottom to top with it is top-down in a manner of remove scanning feature map, such as Shown in Fig. 2.It updates its hiding layer state using the feature of a row grade as the input at a moment.One classics Length in short-term memory unit include an input gate i_t, a forgetting door f_t, an out gate O_t, an output stateOne internal memory unit state C_t.Scanning ruleIt can formalize are as follows:

Therefore, the calculating of two-way length memory network in short-term can turn in the form of

After bilateral scanning, we cascade hiding layer stateWithIt goes to obtain the characteristic spectrum H1 of a mixing. Similar mode, using second two-way length, memory network takes horizontal direction scanning feature map H1 in short-term for we, it is column grade Characteristic spectrum be sliced input as a moment, and update hiding layer state.Then we cascade the shape of forward and backward The characteristic spectrum H2 that state is mixed, the table that this feature map is interacted as each spatial position with the overall situation of other positions Sign.Each response on characteristic spectrum H2 is the activation response of the position and whole image.Then, overall situation interaction letter The input as 1 × 1 convolutional layer is ceased, the consistency transformation parameter for class hierarchy is exported.Most Softmax activation primitive is by then For obtaining normalized transformation parameter φ.Our defined functionsFor the tensorial multiplication between tensor x and y.Newly Characteristic spectrum F can generate as follows:

HereE represents characteristic pattern, specifically special for the consistency of example hierarchy as shown in Figure 1 Sign obtains X, X is by process of convolution, expansion processing and Shape correction by the Res4 network being connected in series by multiple residual units After obtain E.It converts to obtain characteristic spectrum next, we carry out dimension to FIn fact, any The feature vector of position (i, j)It is to be calculated according to such as under type:

Wherein

i∈[1,H],j∈[1,W],φ_ij(h, w)=φ_ij(hw),Be The feature of position (h, w) on feature spectrogram E.

Experiment:

Data set introduction.Cityscapes data set includes the street scene from 50 different cities.This data set It is divided into three subsets, including 2975 picture of training set, verifying 500 pictures of collection and 1525 picture of test set.Data set The 19 class set of pixels mark of high quality is provided.Performance is using the friendship of all classes and the average value of ratio.PASCAL-Context data Collection includes that training set 4998 opens image and verifying 5105 images of collection.This data set provides detailed semanteme for entire scene Mark.It is proposed that model be evaluated at most common 59 class and 1 background classes.

Validity experimental verification:

Table 1 is converted in example consistency converter unit (ICT) and classification consistency of the Cityscapes verifying collection to proposition The validation verification of unit (CCT).

As can be seen from Table 1 compared with benchmark model, can be obtained using example consistency converter unit 78.8meanIoU, in the precision improvement of benchmark model upper right 3.8%.Similarly, use classes consistency converter unit can be 2.6% performance boost is brought on benchmark model.When integrated example consistency converter unit and classification consistency converter unit, divide It cuts precision to be further improved to 79.9%, this has 5.0% performance boost (79.9vs.74.9) compared with benchmark model.This A bit the experiment proves that two the integrated of unit can bring huge segmentation precision to improve.

TFA_S is the one smaller factor (r of configuration in TFA in table 2₁, r₂)={ (6,3), (10；7), (20,15) }

Compared with other methods:

This is a part of, and we can report that our method and other advanced methods compare.

Experimental result on Cityscapes:

Table 2 is compared in Cityscapes test set with other advanced methods.

Experimental result on PASCAL-Context:

Table 3 is compared in Pascalcontext data set with other methods

From table 2 and table 3 we can see that we design system on two authoritative semantic segmentation data sets all Extraordinary performance is achieved, this also further demonstrates effectiveness of the invention.

The following are system embodiment corresponding with above method embodiment, present embodiment can be mutual with above embodiment Cooperation is implemented.The relevant technical details mentioned in above embodiment are still effective in the present embodiment, in order to reduce repetition, Which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above embodiment.

For scanning rule,For the output state of t moment；

New characteristic spectrum F can be generated as follows:

Claims

1. a kind of Scene Segmentation based on consistency feature characterized by comprising

Step 1 uses residual error network as feature extractor, extracts the local feature map in original image, and to the part Characteristic spectrum carries out the transformation of example consistency, obtains the consistency feature of example hierarchy；

Step 2 carries out the transformation of classification consistency to the consistency feature of the example hierarchy, and the consistency for obtaining class hierarchy is special Sign；

Step 3, using category consistency feature as input, the scene point of the original image is exported by scene cut sub-network Cut result.

2. the Scene Segmentation as described in claim 1 based on consistency feature, which is characterized in that the example consistency becomes It changes and includes:

For a local feature mapC is the number of channels for levying map, and H × W is space size, sharp first With 11 × 1 convolution dimensionality reduction local feature map, characteristic spectrum is obtainedC₁It is the port number of characteristic spectrum Amount generates the parameter of example consistency transformation after obtaining characteristic spectrum PHere r is indicated with current empty Meta position is set to the regional area size at center, and the size of parameter θ is converted with the transformation of regional area size r；

Dimension is carried out to convert to obtainThe operation that N=H × W, characteristic spectrum P execute an expansion goes to extract sliding window The feature of buccal mass, and carry out dimension and convert to obtain characteristic spectrumNew characteristic spectrum Q,Wherein functionFor the Element-Level multiplication of tensor x and y, and asked according to the last one dimension With；

3. the Scene Segmentation as claimed in claim 2 based on consistency feature characterized by comprising building classification Consistency converter unit carries out category consistency using consistency feature of the category consistency converter unit to example hierarchy Transformation:

Category consistency converter unit includes: that memory network and 1 convolutional layer go to instantiate global net two two-way length in short-term Network；

Memory network to go in a manner of top-down from bottom to top scans the consistent of the example hierarchy to first two-way length in short-term The characteristic spectrum of property feature, two-way length memory network in short-term specifically:

For scanning rule,For the output state of t moment；

It cascades and hides layer stateWithThe characteristic spectrum H1 of a mixing is obtained, and in short-term using second two-way length Memory network takes horizontal direction to scan this feature map H1, to cascade the characteristic spectrum that the state of forward and backward is mixed H2, this feature map H2 are input to the convolutional layer, obtain the consistency transformation parameter of class hierarchy, to the consistency transformation parameter It is normalized to obtain transformation parameter φ by activation primitive；

New characteristic spectrum F can be generated as follows:

Wherein functionFor the tensorial multiplication between tensor x and y, E is characterized figure, to the new characteristic spectrum F into Row dimension converts to obtain the characteristic spectrum of the consistency feature of category level

4. the Scene Segmentation as claimed in claim 3 based on consistency feature, which is characterized in that the activation primitive is Softmax function.

5. the Scene Segmentation as claimed in claim 3 based on consistency feature, which is characterized in that the convolutional layer uses 1 × 1 convolution kernel.

6. a kind of scene cut system based on consistency feature characterized by comprising

Module 1 uses residual error network as feature extractor, extracts the local feature map in original image, and to the part Characteristic spectrum carries out the transformation of example consistency, obtains the consistency feature of example hierarchy；

Module 2 carries out the transformation of classification consistency to the consistency feature of the example hierarchy, and the consistency for obtaining class hierarchy is special Sign；

Module 3, using category consistency feature as input, the scene point of the original image is exported by scene cut sub-network Cut result.

7. as claimed in claim 6 based on the scene cut system of consistency feature, which is characterized in that the example consistency becomes It changes and includes:

Dimension is carried out to convert to obtainThe operation that N=H × W, characteristic spectrum P execute an expansion goes to extract sliding window The feature of buccal mass, and carry out dimension and convert to obtain characteristic spectrumNew characteristic spectrum Q, Wherein functionFor the Element-Level multiplication of tensor x and y, and summed according to the last one dimension；

8. as claimed in claim 7 based on the scene cut system of consistency feature characterized by comprising building classification Consistency converter unit carries out category consistency using consistency feature of the category consistency converter unit to example hierarchy Transformation:

For scanning rule,For the output state of t moment；

It cascades and hides layer stateWithThe characteristic spectrum H1 of a mixing is obtained, and is remembered in short-term using second two-way length Recalling network takes horizontal direction to scan this feature map H1, to cascade the characteristic spectrum H2 that the state of forward and backward is mixed, This feature map H2 is input to the convolutional layer, obtains the consistency transformation parameter of class hierarchy, logical to the consistency transformation parameter Activation primitive is crossed to be normalized to obtain transformation parameter φ；

New characteristic spectrum F can be generated as follows:

9. as claimed in claim 8 based on the scene cut system of consistency feature, which is characterized in that the activation primitive is Softmax function.

10. as claimed in claim 8 based on the scene cut system of consistency feature, which is characterized in that the convolutional layer uses 1 × 1 convolution kernel.