CN109461157A

CN109461157A - Image, semantic dividing method based on multi-stage characteristics fusion and Gauss conditions random field

Info

Publication number: CN109461157A
Application number: CN201811218436.2A
Authority: CN
Inventors: 周鹏程
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2019-03-12
Anticipated expiration: 2038-10-19
Also published as: CN109461157B

Abstract

The invention discloses it is a kind of based on multi-stage characteristics fusion and Gauss conditions random field image, semantic dividing method, comprising: 1) construct image pyramid；2) keep characteristic pattern resolution ratio constant using empty convolution；3) multi-stage characteristics successively merge tuning framework；4) it is up-sampled with bilinear interpolation；5) loss function is defined；6) the optimization output of Gauss conditions random field.The present invention realizes the full convolution framework that multi-stage characteristics successively merge by building image pyramid, use the instead preceding popular parallel pond module of top-down tuning frame, it is successively merged while obtaining different scale feature, it ensure that the feature between pyramid adjacent layer preferentially merges, contextual information is captured to the maximum extent, the output of front end is advanced optimized using Gauss conditions random field, capture more spatial details, so that object bounds are more accurate in segmentation effect figure, the output of final overall architecture obtains optimal semantic segmentation effect.

Description

Image, semantic dividing method based on multi-stage characteristics fusion and Gauss conditions random field

Technical field

The present invention relates to it is a kind of based on multi-stage characteristics fusion and Gauss conditions random field image, semantic dividing method.

Background technique

Content in image is only divided into roughly several regions by the image segmentation of early stage, and with the development of research, The coarse demand for being no longer satisfied various applications of image segmentation, proposes semantic segmentation therewith.The semanteme of image refers to image Or the semantic informations such as image-region classification of object or entity for being included, it is semantic under image segmentation be known as semantic segmentation.Figure It as semantic segmentation can be by the prospect and background separation in single-frame images, and identifies the classification of each foreground target, is equivalent to A semantic label is imparted to each pixel.Semantic segmentation is all the primary great liter of image segmentation in precision and fineness Grade.

The purpose of image, semantic segmentation is some being classified as each pixel in image in predetermined class label, this Sample can not only be partitioned into region, moreover it is possible to carry out the mark in content to region.In reality the application of more and more visions need from Relevant knowledge or semanteme are inferred in image, i.e., by specific to abstract process.These applications include automatic Pilot, man-machine friendship Mutually, photography is calculated, image search engine, augmented reality etc., the actual demand of behind is accurate and efficient cutting techniques.

The mode that current state-of-the-art image, semantic parted pattern is combined using deep learning with probability graph model.Model Front end is replaced all with last full articulamentum based on the depth convolutional neural networks for being initially image classification task and designing Convolutional layer, thus referred to as full convolutional network can carry out semantic segmentation to the image of arbitrary size.Model rear end is using full connection Condition random field optimizes the semantic feature of convolutional network coarse extraction, enhances the ability of model capture space details.It is macro It is said in sight, full convolutional network is more like a kind of skill, is constantly progressive with the performance boost of basic network.Deep learning adds generally Rate graph model is a kind of trend: deep learning carries out feature extraction, and probability graph model can be explained well from mathematical angle Essential connection between things.

Specifically, ready-made convolutional neural networks are utilized as it in the image, semantic dividing method based on full convolutional network One of module generates the feature of stratification, and convert full convolution model for existing well-known disaggregated model: full articulamentum replaces It is changed to convolutional layer, output is space reflection and non-categorical score.These mappings are obtained by small stride convolution up-sampling (deconvolution) It arrives, to generate intensive pixel level tag.This work that Long et al. takes the lead in being done is considered as landmark progress, Because it elaborates how convolutional neural networks can be trained end to end in semantic segmentation problem, and efficiently learns How the Tag Estimation of pixel scale is generated based on the input of arbitrary size for semantic segmentation problem.Vijay et al. is proposed SegNet introduces more jump connection, and has used maximum pondization index rather than the encoder spy in full convolutional network Sign, therefore SegNet more saves memory for full convolutional network.Fisher et al. is proposed using empty convolution as real The convolutional layer of existing Pixel-level prediction, can be such that receptive field exponentially increases under the premise of not reducing Spatial Dimension.Zhao etc. People thinks that global scene is capable of providing the category distribution information of image, semantic segmentation, then proposes pyramid pond module by making These information are obtained with the pond layer of larger core；

Method based on depth convolutional Neural seems innately to be difficult to get both between classification performance and positioning accuracy, convolution net Network to the powerful invariance that Image space transformation has allow to accurately predict target presence and rough location information, Segmentation result is often not enough coordinated, and certain zonules are likely to occur incorrect mark, inconsistent with its surrounding pixel mark, Thus can not accurately rendered object boundary.Equally, cross entropy loss function is also not the optimal loss letter of semantic segmentation Number, because the final penalty values of image are the superposition of each pixel loss value, intersecting entropy loss cannot be guaranteed the continuous of pixel Property.The output of tuning segmentation framework and universal method for strengthening its capture fine granularity information are introducing condition random field conducts Its rear end processing module makes the conditional random field models connected entirely as an independent post-processing step in its process, with This carries out tuning to segmentation result.Although full link model be usually it is inefficient, the model due to can with probability inference come It is approximate, it is possible to reach opposite efficient.The similitude between pixel is utilized in original image in condition random field.

For the modern architectures of image, semantic segmentation, it is primarily present two problems:

1, front end features, which extract, loses contextual information: being currently based in the semantic segmentation problem of full convolutional network, exports The resolution ratio of image is consistent with input picture.However, continuous pondization and skip operation reduce the resolution of gained characteristic pattern Rate, subsequent characteristic pattern are smaller and smaller, it is necessary to be up-sampled by deconvolution operation to characteristic pattern, also by the characteristic pattern of diminution Original arrives original size, to restore spatial resolution.This process necessarily causes loss information that can not restore, and frequent warp Product operation is also required to additional memory and time.So image, semantic segmentation needs the information to a variety of scales to be integrated, It needs to be balanced local message and global information.On the one hand, fine granularity information local in other words is for improving Pixel-level It is crucial, another aspect for the correctness of other mark, the global context information of integral image is for solving localized mode It is also critically important for paste property problem.Standard convolution neural network framework and being bad at handle this local message and global information it Between balance, the computational efficiency that pond layer can make network obtain space-invariance to a certain degree and maintain like but loses Global context information, similarly limited without the convolutional network of pond layer, after all the receptive field of its neuron can only be with layer It counts and linear increase.

2, rear end model optimization is difficult to restrain: the output that rear end introduces condition random field Optimized Segmentation framework is caught with strengthening it Catch the ability of details.However, it is multiple all to there is formula in either common gibbs condition random field or Markov random field It is miscellaneous, solve the problems such as constant, convergence is slow when whens these problems often results in model training, it is difficult to which convergence such as does not restrain even at the feelings Condition.

Summary of the invention

The technical problem to be solved by the present invention is to provide it is a kind of based on multi-stage characteristics fusion and Gauss conditions random field image Semantic segmentation method.

The technical scheme is that a kind of image, semantic segmentation based on multi-stage characteristics fusion and Gauss conditions random field Method, comprising the following steps:

1) image pyramid is constructed

One four layers of image pyramid is constructed to single-frame images, each layer is numbered from the bottom to top, and i-th layer is expressed as G_i, In order to generate the i+1 layer in image pyramid, with a Gaussian kernel k_GaussianTo G_iConvolution operation is executed, and is deleted later Each even number row and column, Gaussian kernel k_GaussianIt is expressed as formula (1):

Then in input original image G₀The upper iteration above process, ultimately generates whole image pyramid；

2) keep characteristic pattern resolution ratio constant using empty convolution

This kind of 2D signal of image from the point of view of filter, filter w do empty convolution operation on input feature vector figure x Obtain the output y of position i:

Wherein voidage r corresponds to the stride sampled to input signal, is equivalent to by input x and by each sky Between be inserted into r-1 zero between two continuous filtering wave numbers in dimension and the up-sampling filter that generates carries out convolution, Standard convolution It is equivalent to the special case of voidage r=1；

3) multi-stage characteristics successively merge tuning framework

Image pyramid comprising different resolution image is generated by original image, is opened from the image of top layer minimum resolution Begin to generate the characteristic pattern comprising local message by several full convolution operations, then is upsampled to identical with neighbouring tomographic image Resolution ratio stacks the subsequent several full convolution operations for participating in this layer with adjacent layer initial characteristics figure, i.e. fusion obtains new part Feature, with this, gradually tuning, the last layer have carried out 1 × 1 to obtain final segmentation figure from characteristic pattern from top to bottom Convolution operation is simultaneously finely tuned；

4) it is up-sampled with bilinear interpolation

The low resolution characteristic pattern comprising local feature is up-sampled to be fused to and include by bilinear interpolation In the high-resolution features figure of opposite global characteristics；

5) loss function is defined

It is total using the cross entropy item for minimizing each spatial position in cross entropy loss function calculating convolutional network output figure With, or perhaps calculate the distance between the prediction probability distribution of each pixel and the distribution of its true probability:

Wherein, L is the cross entropy loss function to wrong tag along sort, and R is a regular terms, and function L is usually decomposed For the sum of the loss of specified pixel:

Then by executing stochastic gradient descent to cross entropy loss function, in point of 2012 data set of PASCALVOC Cut trim network model in task；

6) the optimization output of Gauss conditions random field

It does rear end using Gauss conditions random field to optimize, the energy function E (x) of Gauss conditions random field are as follows:

When A+ λ I is symmetric positive definite matrix, the minimum value for solving E (x) is equivalent to solve equation:

(A+ λ I) x=B (12)

Further, in the present invention, in step 2), empty convolution allows by changing voidage r to be adaptively modified The size of the size of receptive field, receptive field can indicate are as follows:

F=((k_size+1)r_rate-1) (3)

Wherein, k_sizeRepresent the actual size of convolution kernel, r_rateRefer to voidage.

Further, in the present invention, in step 4), bilinear interpolation calculates target picture using four points in original image Element value carries out interpolation, according to Q₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) the f function value at 4 points can count Value of the unknown function f at point P (x, y) is calculated, is specifically included:

The first step carries out linear interpolation in the x direction and obtains:

Wherein R₁=(x, y₁), R₂=(x, y₂)；

Second step passes through the calculated R of the first step to linear interpolation is carried out on the direction y₁With R₂Interpolation calculation in y-direction P point out:

Wherein P=(x, y).

Compared with the prior art, the present invention has the following advantages:

1) the invention proposes a kind of tuning strategies that multi-stage characteristics successively merge, and are carried out using image pyramid to original image Multiple dimensioned pixel sampling is to generate the images of several different resolutions, before inputting network by the minimum image of top layer resolution ratio End carries out the coarse extraction of semantic feature, and suitable with the next layer of slightly higher image of resolution ratio being finally upsampled to, and entrance is next Layer participates in the input of network front end together with new image in different resolution.This preceding layer learns Hierarchical abstraction and later layer capture is high-precision The layer-by-layer amalgamation mode of degree information effectively increases the ability that model captures contextual information.

2) in the present invention, rear end is obviously had secondary using the Gauss conditions random field in full condition of contact random field Many advantages, such as global solution of energy, convergence rate is slowly even difficult to the problems such as restraining when model training has been effectively relieved.

Detailed description of the invention

The invention will be further described with reference to the accompanying drawings and embodiments:

Fig. 1 is the basic framework figure of the method for the present invention；

Fig. 2 is the structural schematic diagram of image pyramid in the present invention；

Fig. 3 (a) is 3 × 3 empty convolution schematic diagrames that voidage is 1 in the present invention；

Fig. 3 (b) is 3 × 3 empty convolution schematic diagrames that voidage is 2 in the present invention；

Fig. 3 (c) is 3 × 3 empty convolution schematic diagrames that voidage is 4 in the present invention；

Fig. 4 is the layer-by-layer fusion architecture schematic diagram of multi-stage characteristics in the present invention；

Fig. 5 is the schematic diagram of bilinear interpolation in the present invention；

Fig. 6 is the schematic diagram of full condition of contact random field in the present invention；

Fig. 7 (a) is the iterative process schematic diagram that regular terms is punished in the present invention；

Fig. 7 (b) is that schematic diagram is lost in the recurrence of parted pattern in the present invention；

Fig. 7 (c) be the present invention in canonical loss and return loss the sum of schematic diagram；

Fig. 8 is that score and velocity contrast scheme on PASCAL VOC 2012 in the present invention；

Fig. 9 is that figure is compared in the visualization in the present invention in 2012 data of PASCAL VOC.

Specific embodiment

Embodiment:

A kind of image, semantic point based on multi-stage characteristics fusion and Gauss conditions random field of the present invention is shown in conjunction with attached drawing The specific embodiment of segmentation method, the basic framework figure of this method is as shown in Figure 1, this method is realized by building image pyramid The full convolution framework that multi-stage characteristics successively merge, it is intended to use the instead preceding popular parallel pond of top-down tuning frame Module successively merges while obtaining different scale feature, ensure that the feature between pyramid adjacent layer preferentially merges, most Capture contextual information to limits.In addition, advanced optimizing the output of front end using Gauss conditions random field, capture is more Spatial detail, so that object bounds are more accurate in segmentation effect figure.The output of final overall architecture obtains optimal semanteme point Cut effect.

The present invention passes through the initial characteristics figure of each layer of image pyramid of calculating, and the up-sampling knot with upper one layer of characteristic pattern Enter this layer of remaining full convolution operation after fruit series connection together, final semantic segmentation figure passes through the last time of pyramid bottom Full convolution operation and obtain.In the method, following steps are specifically divided into: the building image pyramid of front end, using sky Hole convolution keep characteristic pattern resolution ratio is constant, multi-stage characteristics successively merge tuning framework, up-sampled with bilinear interpolation, Define loss function, the optimization output of the Gauss conditions random field of rear end.

1, the full convolutional network in front end extracts rough semantic feature

1) image pyramid is constructed

SIFT algorithm obtains the characteristic information on different scale by one image pyramid of building, and this method is to single frames Image also constructs one four layers of image pyramid.Imagine that pyramid is one group of figure layer, figure layer is higher, and size is smaller.

As shown in connection with fig. 2, each layer is numbered from the bottom to top, and i-th layer is expressed as G_i, so i+1 layer (uses G_i+1It indicates) Resolution ratio be less than i-th layer and (use G_iIt indicates).In order to generate the i+1 layer in image pyramid, we are with a Gaussian kernel k_GaussianTo G_iConvolution operation is executed, and is deleting each even number row and column, Gaussian kernel k later_GaussianIt is expressed as formula (1):

Can readily note that, the image of generation by be its forerunner a quarter.In input original image G₀On The iteration above process ultimately generates whole image pyramid.

Using the empty convolution method being initially applied in field of signal processing wavelet transformation analysis, by removing full convolution Last several layers of down-sampling operation and relevant up-sampling operate to extract dense feature in network.In this way, can be effectively The resolution ratio of characteristic pattern in convolutional neural networks is controlled without learning additional parameter, so as to avoid in Standard convolution network The resolution ratio of characteristic pattern is smaller and smaller, is up-sampled characteristic pattern by the characteristic pattern of diminution so that needing to operate by deconvolution The problem of original size is reverted to restore spatial resolution, will necessarily but lose information, and frequent deconvolution operation Need additional memory and time.

Wherein voidage r corresponds to the stride sampled to input signal, is equivalent to by input x and by each sky Between be inserted into r-1 zero between two continuous filtering wave numbers in dimension and the up-sampling filter that generates carries out convolution, because being referred to herein as Empty convolution.Standard convolution is equivalent to the special case of voidage r=1.

Empty convolution allows us by changing voidage r to be adaptively modified the size of receptive field, in conjunction with Fig. 3 (a), Shown in Fig. 3 (b), Fig. 3 (c).Wherein Fig. 3 (a) is that 3 × 3 convolution kernels of standard act on receptive field and common convolution operation one Sample.Fig. 3 (b) is 3 × 3 empty convolution of rate=2, and convolution kernel actual size is still 3 × 3, and only receptive field increases To 7 × 7.Fig. 3 (c) is 3 × 3 empty convolution of r=4, and convolution kernel actual size is also still 3 × 3, and receptive field is then Increase to 15 × 15.The size of receptive field can indicate are as follows:

F=((k_size+1)r_rate-1) (3)

Wherein, k_sizeRepresent the actual size of convolution kernel, r_rateRefer to voidage.As can be seen that using different voidages k_size×k_sizeConvolution kernel acts on the feature of the same available different feeling open country of input.

3) multi-stage characteristics successively merge tuning framework

Parallel pond thought most outstanding at present has merged local message and global information at all levels well, still Only being simply uniformly upsampled to equal resolution forms final characteristic pattern for final polymerization.Our method thinks, ruler Very little neighbouring characteristic pattern merged in advance as part relatively with global information can preferably restore due to resolution ratio reduction and The information of loss then proposes a kind of new Feature fusion.

As shown in connection with fig. 4, the image pyramid comprising different resolution image is generated by original image, from top layer minimum point The image of resolution begin through several full convolution operations and generate include local message characteristic patterns, then be upsampled to it is neighbouring The identical resolution ratio of tomographic image stacks the subsequent several full convolution operations for participating in this layer with adjacent layer initial characteristics figure, that is, merges New local feature is obtained, with this gradually tuning from top to bottom.This step-by-step procedure obtains on the basis of obtaining good details Semantic information as strong as possible incorporates the context of different zones more efficiently to obtain global contextual information. The last layer from characteristic pattern in order to obtain convolution operation and the fine tuning that final segmentation figure has carried out 1 × 1.

4) it is up-sampled with bilinear interpolation

Characteristic pattern is restored to size identical with adjacent layer using bilinear interpolation by our method, merges spy with this Sign.Computation complexity when guaranteeing the simple clarity of the network architecture and reduce training.

Bilinear interpolation is that a kind of more top sampling method is used to obtain in current semantic segmentation.The characteristics of this method It is not need to learn, the speed of service is fast, easy to operate, it is only necessary to set fixed parameter value.

Specifically as shown in connection with fig. 5, it is assumed that calculate value of the unknown function f at point P (x, y), it is known that condition is Q₁₁(x₁, y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) f function value at 4 points.

The first step carries out linear interpolation in the x direction and obtains:

Wherein R₁=(x, y₁), R₂=(x, y₂)；

Wherein P=(x, y).

In short, bilinear interpolation, which calculates target pixel value using four points in original image, carries out interpolation.Our net It includes phase that network model, which up-samples the low resolution characteristic pattern comprising local feature by bilinear interpolation to be fused to, To in the high-resolution features figure of global characteristics.Up-sampling, which is carried out, with bilinear interpolation replaces complicated deconvolution operation, drop The low complexity of model.

5) loss function is defined

Loss function used in our model is damaged as most of semantic segmentation models using cross entropy is minimized It loses, which calculates the cross entropy item summation of each spatial position in convolutional network output figure, or perhaps calculates each The prediction probability of pixel is distributed the distance between the distribution of its true probability:

The weight of each pixel is equally distributed in loss function default image defined above, so that learning algorithm is inclined The region big to the accounting in image, and ignore the small region of accounting.We are by executing boarding steps to cross entropy loss function Degree decline, the trim network model in the segmentation task of 2012 data set of PASCALVOC.

2, the optimization output of rear end condition random field

As shown in connection with fig. 6, use the conditional random field models connected entirely between any two as an independent post-processing step Suddenly, tuning can be carried out to segmentation result.The model by each pixel modeling be certain region in a node, no matter two pictures How far is element distance, and relationship between any two can be all scaled, and therefore, this model is also referred to as intensive or full connection exception figure. After this model, no matter short-term or long-term pixel correlation is all evaluated, so that system is considered that The detailed information needed in cutting procedure.

There is class label x for each pixel i_iThere are also corresponding observation y_i, pixel each in this way as node, Relationship between pixel and pixel constitutes a condition random field as side.And we pass through observational variable y_iTo speculate The corresponding class label x of pixel i_i。

The energy function E (x) of full condition of contact random field are as follows:

Unitary potential function ∑ therein_iΨ_u(x_i) i.e. from the output of the full convolutional network in front end.And binary potential function is such as Under:

Binary potential function is exactly the relationship described between pixel and pixel, and similar pixel is encouraged to distribute identical mark Label, and differ biggish pixel and distribute different labels, and the definition of this distance is related with color value and practical relative distance.Institute Image can be made to divide as far as possible in boundary with such condition random field.And the difference of full condition of contact random field is that, two First potential function describes the relationship of each pixel Yu other all pixels, so crying full connection.

6) the optimization output of Gauss conditions random field

There are formula complexity for discrete full condition of contact random field, when solving the problems such as constant, and then leading to model training Convergence is slow, it is difficult to the case where convergence does not restrain even.We do rear end using Gauss conditions random field and optimize, Gauss item The energy function of part random field again from it is different before:

(A+ λ I) x=B (12)

It can be seen that the second energy function of Gauss conditions random field possesses clear global minimum, linear equation is solved Energy function optimization than solving complicated is much simpler.

The data set that experiments of the invention use is: PASCAL VOC 2012.PASCAL VOC 2012 is current Divide the most common data set in field in image, semantic, we can verify image, semantic segmentation of the invention on this data set Effect has certain superiority.The data set includes 20 foreground objects and 1 background object, wherein being respectively used to train 1464 images, 1449 images for verifying and 1456 images for test.The artificial number such as Hariharan A large amount of additional marks are provided according to collection, the training image of data set is increased 10582.We pass through the precision of method The pixel of this 21 classifications hands over union average value to measure:

Wherein, k is the number of foreground object, p_ijRefer to the number for originally belonging to the pixel that the i-th class is but classified into jth class Amount.

Experimental Hardware environment: Ubuntu16.04, Core i7 processor, dominant frequency 3.6GHz inside save as 48G, explicit card type It number is NVIDIA GTX1080.Code running environment is: Python3.6.5+TensorFlow1.8.0.

Our experiment is based on TensorFlow, executes stochastic gradient descent algorithm with batch minimum value 14 and momentum 0.9, Iteration 600000 times, learning rate abides by iteration agreement:

Wherein power=0.9.

As shown in Fig. 7 (a), Fig. 7 (b), Fig. 7 (c), Fig. 7 (a) be for prevent over-fitting and increased punishment regular terms repeatedly For process, Fig. 7 (b) is the recurrence loss of parted pattern, gradually converges to what some more optimized after 600000 iteration Region, Fig. 7 (c) are the sum of canonical loss and recurrence loss, are the optimization process of entire objective function.As can be seen that mesh in figure The optimization process of scalar functions is not sailed right before the wind, and is become by just can gradually show global convergence after the biggish iteration of number Gesture.

Algorithm of the invention and FCN (2015, Fully Convolutional Networks for Semantic Segmentation), DeepLab (2017, DeepLab:Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs) and PSPNet (2017 Year, Pyramid Scene Parsing Networks) it is compared, as a result it is presented in Fig. 9.Me is can be found that from figure Method can actually be partitioned into target object well, pixel positioning is in contrast more accurate, than FCN and DeepLab There is apparent advantage.Because our method is the improvement that the parallel pond Multiscale Fusion thought based on PSPNet is made, point It cuts effect to be closer to compared with PSPNet, but still can be found that and be better than PSPNet method in the subtle ground general plan in several places, There is certain advantage for generally.

As shown in Figure 8, FCN network structure is most simple, does not have warp lamination, so processing speed is most fast；DeepLab is then It joined condition random field on the basis of FCN and carry out rear end optimization, therefore processing speed is many slowly.Our method is in precision On have slightly advantage compared with PSPNet, and because there is Gauss conditions random field to do rear end secondary treatment in speed, thus compared with More slowly, but the second energy function formula of Gauss conditions random field is succinct for the processing speed of PSPNet, and convenience of calculation is compared For DeepLab more rapidly.

Score in 2012 data of table 2PASCAL VOC

We finally execute assessment on 2012 test set of PASCAL VOC.It is as shown in the table, in order to analyze different objects Segmentation, we list the segmentation precision of all objects in 2012 data set of PASCAL VOC.It can be seen that the best way is simultaneously It is not all to be optimal and some special to the segmentation effect of all objects or disconnected object is led to by partial occlusion It is larger to divide difficulty.But for total mIOU, our method is really sayed such as prediction, the mode that multi-stage characteristics successively merge There is advantage than many other methods to a certain extent.

Certainly the above embodiments merely illustrate the technical concept and features of the present invention, and its object is to allow be familiar with technique People can understand the content of the present invention and implement it accordingly, it is not intended to limit the scope of the present invention.It is all according to this hair The modification that the Spirit Essence of bright main technical schemes is done, should be covered by the protection scope of the present invention.

Claims

1. a kind of image, semantic dividing method based on multi-stage characteristics fusion and Gauss conditions random field, which is characterized in that including Following steps:

1) image pyramid is constructed

One four layers of image pyramid is constructed to single-frame images, each layer is numbered from the bottom to top, and i-th layer is expressed as G_i, for life At the i+1 layer in image pyramid, with a Gaussian kernel k_GaussianTo G_iConvolution operation is executed, and is deleting each idol later Number row and column, Gaussian kernel k_GaussianIt is expressed as formula (1):

This kind of 2D signal of image from the point of view of filter, filter w do empty convolution operation on input feature vector figure x and obtain The output y of position i:

Wherein voidage r corresponds to the stride sampled to input signal, is equivalent to by input x and by each space dimension The up-sampling filter for being inserted into r-1 zero between two continuous filtering wave numbers on degree and generating carries out convolution, and Standard convolution is suitable In the special case of voidage r=1；

3) multi-stage characteristics successively merge tuning framework

Image pyramid comprising different resolution image is generated by original image, is led to since the image of top layer minimum resolution It crosses several full convolution operations and generates the characteristic pattern comprising local message, then be upsampled to resolution identical with neighbouring tomographic image Rate stacks the subsequent several full convolution operations for participating in this layer with adjacent layer initial characteristics figure, i.e. fusion obtains new local feature, With this gradually tuning from top to bottom, the last layer from characteristic pattern in order to obtain the convolution behaviour that final segmentation figure has carried out 1 × 1 Make and finely tunes；

4) it is up-sampled with bilinear interpolation

It includes opposite for being up-sampled to the low resolution characteristic pattern comprising local feature by bilinear interpolation to be fused to In the high-resolution features figure of global characteristics；

5) loss function is defined

The cross entropy item summation that convolutional network exports each spatial position in figure is calculated using cross entropy loss function is minimized, or It is the distance calculated between the prediction probability distribution of each pixel and the distribution of its true probability that person, which says:

Wherein, L is the cross entropy loss function to wrong tag along sort, and R is a regular terms, and function L is usually decomposed into finger Determine the sum of the loss of pixel:

Then by executing stochastic gradient descent to cross entropy loss function, in the segmentation task of 2012 data set of PASCALVOC Upper trim network model；

6) the optimization output of Gauss conditions random field

(A+ λ I) x=B (12).

2. the image, semantic dividing method according to claim 1 based on multi-stage characteristics fusion and Gauss conditions random field, It is characterized by: empty convolution allows by changing voidage r to be adaptively modified the size of receptive field, sense in step 2) It can be indicated by wild size are as follows:

F=((k_size+1)r_rate-1) (3)

3. the image, semantic dividing method according to claim 1 based on multi-stage characteristics fusion and Gauss conditions random field, It is characterized by: bilinear interpolation calculates target pixel value using four points in original image and carries out interpolation, root in step 4) According to Q₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) the f function value at 4 points can calculate unknown function f in point P Value at (x, y), specifically includes:

The first step carries out linear interpolation in the x direction and obtains:

Wherein R₁=(x, y₁), R₂=(x, y₂)；

Second step passes through the calculated R of the first step to linear interpolation is carried out on the direction y₁With R₂Interpolation calculation goes out P in y-direction Point:

Wherein P=(x, y).