CN110929744B

CN110929744B - Hierarchical joint convolution network feature-based weak supervision image semantic segmentation method

Info

Publication number: CN110929744B
Application number: CN201811103919.8A
Authority: CN
Inventors: 朱策; 文宏雕; 段昶; 徐榕键
Original assignee: Chengdu Tubiyou Technology Co ltd
Current assignee: Chengdu Tubiyou Technology Co ltd
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2023-04-28
Anticipated expiration: 2038-09-20
Also published as: CN110929744A

Abstract

The invention belongs to the technical field of computer vision, relates to the aspects of convolutional neural network, image semantic segmentation, weak supervision learning, feature fusion and the like, and particularly relates to a hierarchical joint convolutional network feature-based weak supervision image semantic segmentation method. The method comprises innovative technologies such as generation of a hierarchical masking matrix, establishment of a hierarchical convolutional neural network, feature combination of the hierarchical convolutional network, establishment and optimization of a hierarchical and combined image classification loss function and the like. Masking the salient regions by using the former level of convolution network for classification forces the latter level of convolution network to extract relatively insignificant region features and identify non-dominant portions of the object. And repeating the steps to obtain a plurality of layers of convolution networks which are respectively responsible for mining the regional characteristics with different saliency, and then connecting the output characteristic graphs together to form a combined characteristic graph to realize a more complete and accurate regional characteristic mining model.

Description

Hierarchical joint convolution network feature-based weak supervision image semantic segmentation method

Technical Field

The invention belongs to the technical field of computer vision, relates to the aspects of convolutional neural network, image semantic segmentation, weak supervision learning, feature fusion and the like, and particularly relates to a weak supervision image semantic segmentation method based on hierarchical joint convolutional network features.

Background

Image semantic segmentation is one of three basic tasks in computer vision. The definition of the semantic segmentation of an image is to classify all pixels that appear one by one. And because it is a classification task at the pixel level, the difficulty of classifying and identifying objects relative to images is much greater. Currently, most of the leading semantic segmentation algorithms are feature extraction through convolutional neural networks (Convolutional Neural Network, CNN). While CNNs have great advantages over traditional models, a large amount of label data is required to fit deep CNNs well. However, the production of the pixel-level image semantic segmentation labels consumes a great deal of manpower and material resources, so that the fully supervised semantic segmentation model is difficult to rapidly expand, and the image semantic segmentation technology based on weak supervised learning is attracting more and more attention. Wherein weak supervised image semantic segmentation based on image class labels is of greatest interest.

How to link image classification with semantic segmentation is one of the focus of research to achieve weakly supervised image semantic segmentation based on image class labels, because image classification requires only support of typical features, which tend to be distributed over a partial region of the target. The segmentation results usually obtained directly through the image classification network are not sufficiently accurate and complete. First, singh et al propose a model that masks the input image to force the network learning weakness feature to achieve weak supervision targeting and behavior targeting (Singh K, lee YJ. Hide-and-Seek: forcing a Network to be Meticulous for Weakly-supervised Object and Action Localization [ J ]. 2017.). Later, wei et al proposed a method for performing weakly supervised semantic segmentation based on multi-entity challenge erasure significance regions (Wei Y, feng J, liang X, et al object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach [ J ]. 2017:6488-6496.). The disadvantage is that multiple networks of the same structure need to be trained to be responsible for identifying and locating regional features of different significance, respectively. And the entities are mutually independent and are mutually associated without display so as to be dynamically adjusted. The weak supervision semantic segmentation method for realizing more comprehensive and complete regional feature mining by utilizing a single network to automatically mask different salient regional features at the same time has not been proposed and applied yet.

Disclosure of Invention

In order to enrich the diversity of the features of the convolution network and improve the recognition capability of the sub-salient features in the semantic segmentation of the weak supervision image, the invention provides a weak supervision semantic segmentation method based on the features of the hierarchical joint convolution network.

The technical scheme adopted by the invention is as follows:

step 1: image X and corresponding output category label y are determined. The convolutional neural network phi is selected as a basic model, and a basic feature map is obtained after the image X is input into the network phi

F＝Φ(X) (1)

Wherein h, w and c represent the length, width and channel number of the basic feature map, respectively.

Step 2: the basic feature map F is masked in k layers. The masking matrix of the ith hierarchy is

The multiplication of all the covering matrixes before the current level and the basic feature map are multiplied channel by channel to obtain the covering feature map: />

Wherein ". Is Hadamard product. With the exception of the exceptions, the expression will default to all k levels in this specification.

The values of the 1 st level mask matrix are all 1:

the other level mask matrix value calculation method is shown in step 7.

Step 3: the mask feature map is convolved in k layers. The convolution network of the ith hierarchy uses H _i Representation, corresponding generation of hierarchical feature graphs

FH _i ＝H _i (FM _i ) (4)

Step 4: the hierarchical feature map is obtained by one convolutionTo a segmentation feature map

Wherein c _o Representing the number of target classes, assuming that the i-th layer partitions the convolution kernel to be Kseg _i The calculation method of the segmentation feature map is as follows:

Fseg _i ＝FH _i *Kseg _i (5)

where x represents the convolution operation.

Step 5: the segmentation feature map is convolved again to obtain a classification feature map

Assuming that the i-th layer classification convolution kernel is Kcls _i The expression of the classification feature map is:

Fcls _i ＝Fseg _i *Kcls _i (6)

step 6: the classification feature map obtains classification activation vectors through global pooling

If the global pooling operation is denoted as p, the classified activation vector is:

Acls _i ＝Ρ(Fcls _i ) (7)

when pooling is global average pooling, the classified activation vectors are:

when pooling is global maximum pooling, the classified activation vectors are:

step 7: mapping the classified probability vectors Aprob through Softmax function _i . The probability of class j is:

step 8: generating i+1th level masking matrix from segmentation feature map

Firstly, normalizing the value of the ith hierarchical segmentation feature map to interval 0 to 1 to obtain a normalized feature map +.>

Wherein epsilon has the effect of guaranteeing the stability of the division.

Then threshold separation is carried out on the standard feature diagram to obtain a separated feature diagram

Areas below the threshold will be preserved and areas above the threshold will be masked: />

Wherein the threshold is denoted gamma.

Finally, the separation feature map is maximized in the category dimension to obtain a masking matrix of the next level:

step 9: and (5) completing the establishment of the hierarchical convolution network. And judging whether the current hierarchical level reaches the maximum level number k. If the termination level convolution is satisfied, otherwise, repeating the step 2-8.

Step 10: a joint hierarchical convolutional network. Connecting the hierarchical feature graphs output by all hierarchical convolution networks together to obtain a joint feature graph

Fcomb＝concate(FH ₁ ,FH ₂ ,...,FH _k ) (14)

Where concate represents a feature map join operation, here performed in the feature map channel dimension.

Step 11: and sequentially obtaining a joint segmentation feature map, a joint classification activation vector and a joint classification probability vector by using the joint feature map. Let Kcomb_seg and Kcomb_cls be the joint segmentation convolution kernel and the joint classification convolution kernel, respectively. The operation mode is consistent with the steps 4-7:

wherein the method comprises the steps of

Step 12: an image classification objective function is established. The objective function includes a hierarchical classification loss function and a joint classification loss function. Both class loss functions are calculated by cross entropy of the respective class activation vector and class label. The hierarchical classification loss function is averaged and the weight of the hierarchical classification loss function and the joint classification loss function are respectively one unit. The method comprises the following steps:

wherein the analog tag y is one-hot encoded, taking 1 only when the image has an object, and taking 0 in other cases.

Step 13: calculating error loss by taking equation as objective function, and adjusting network phi, H by back propagation algorithm _i ,Kseg _i ,Kcls _i Kcomb_seg and Kcomb_cls, the model composed of all the above networks and parameters is denoted by ψ. Where i is between 1 and k. The training is repeated for s steps.

Step 14: predictive segmentation result graph using trained model ψ

Taking the maximum index in the category channel dimension of the joint segmentation feature map as prediction:

Pseg＝argmax(Fcomb_seg) (17)

where the dimension in which argmax acts is the third dimension, i.e. the class dimension, and thus the final predictive segmentation map is reduced to a two-dimensional matrix.

Drawings

FIG. 1 is a weakly supervised image semantic segmentation model based on hierarchical federated convolutional network features;

FIG. 2 is a schematic diagram of a hierarchical federated convolutional network of the present invention, shown at a hierarchical number of 4;

FIG. 3 is a flow chart of a weak supervision image semantic segmentation method based on hierarchical joint convolution network characteristics;

fig. 4 is a comparison chart of the effects of the weak supervision image semantic segmentation method based on the hierarchical joint convolution network characteristics. Wherein columns 1 to 4 represent the input image, the true segmentation label, the original model segmentation and the newly proposed model segmentation, respectively.

Detailed Description

The operation steps of the present invention will be described below with reference to the accompanying drawings and practical examples.

Step 1: image X and corresponding output category label y are determined. The invention uses PASCAL VOC (Everingham, M., eslami, S.M.A., van Gool, L., williams, C.K.I., win, J.and Zisserman, A.International Journal of Computer Vision,111 (1), 98-136,2015) as training and evaluation data set, selects classical convolutional neural network VGG-16 as basic model to extract depth characteristics, and obtains basic characteristic diagram after inputting image X into network phi

The length and width of the basic feature map and the number of channels are 40, 40 and 512, respectively.

Step 2: the basic feature map F is masked in 4 layers. Masking of the ith hierarchyThe matrix is

The multiplication of all the covering matrixes before the current level and the basic feature map are multiplied channel by channel to obtain the covering feature map:

wherein the values of the 1 st level mask matrix are all 1:

the other level mask matrix value calculation method is shown in step 7. See in particular figure 2 of the drawings of the description.

Step 3: the mask feature map is convolved in 4 layers. The convolution network of the ith hierarchy uses H _i Representation, corresponding generation of hierarchical feature graphs

FH _i ＝H _i (FM _i )

The number of channels in all hierarchical feature maps is set to 256.

Step 4: the hierarchical feature map is subjected to one-time convolution to obtain a segmentation feature map

One more channel feature map, where the target class number of pasal VOCs is 20, represents the background. Assume that the i-th layer divides the convolution kernel into Kseg _i The calculation method of the segmentation feature map is as follows:

Fseg _i ＝FH _i *Kseg _i

where x represents the convolution operation.

Fcls _i ＝Fseg _i *Kcls _i

Acls _i ＝Ρ(Fcls _i )

when the invention is specifically described by taking pooling as an example, the classified activation vectors are:

step 8: generating i+1th level masking matrix from segmentation feature map

The value of epsilon is chosen to be 1e-7. Then threshold separation is carried out on the standard feature diagram to obtain a separated feature diagram

Areas below the threshold will be preserved and areas above the threshold will be masked:

wherein the threshold γ is set to 0.9.

step 9: and (5) completing the establishment of the hierarchical convolution network. And judging whether the current hierarchical level reaches the maximum level number 4. If the termination level convolution is satisfied, otherwise, repeating the step 2-8.

Fcomb＝concate(FH ₁ ,FH ₂ ,...,FH ₄ )

Where concate represents a feature map join operation, which is performed in the feature map channel dimension.

wherein the method comprises the steps of

Step 13: calculating error loss by taking equation as objective function, and adjusting network phi, H by back propagation algorithm _i ,Kseg _i ,Kcls _i Kcomb_seg and Kcomb_cls, the model composed of all the above networks and parameters is denoted by ψ. Wherein i is between 1 and 4. Training was repeated for 30000 steps.

Step 14: predictive segmentation result graph using trained model ψ

Pseg＝argmax(Fcomb_seg)

where the dimension in which argmax acts is the third dimension, i.e. the class dimension, and thus the final predictive segmentation map is reduced to a two-dimensional matrix. The average cross-over ratio (mIoU) is used as an evaluation index, and the performance of the weak supervision image semantic segmentation method based on the hierarchical joint convolution network characteristics in the PASCAL VOC verification set is compared with the following table:

TABLE 1 hierarchical federated convolutional network feature Performance comparison

Model features	mIoU(％)
		Single layer convolutional network features	53.9
Hierarchical federated convolutional network features	55.4

As shown in Table 1, the model based on the hierarchical joint convolution network features is 1.5% higher than the mIoU index of the validation set. Further, with reference to fig. 4, the effectiveness of the weak supervision image semantic segmentation method based on the hierarchical joint convolution network features is also illustrated in terms of actual effect.

Claims

1. A semantic segmentation method for a weak supervision image based on a hierarchical joint convolution network feature comprises the following steps:

step 1: determining an image X and a corresponding output category label y; the convolutional neural network phi is selected as a basic model, and a basic feature map is obtained after the image X is input into the network phi

F＝Φ(X)(1)

Wherein h, w and c respectively represent the length, width and channel number of the basic feature map;

step 2: dividing the basic feature map F into k layers for covering; the masking matrix of the ith hierarchy is

The masking matrix and the basic feature map are multiplied channel by channel to obtain a masking feature map:

FM _i ＝F⊙M _i ，i＝1，2，...，k(2)

wherein, the ≡is Hadamard product, all expressions will default to k layers;

the values of the 1 st level mask matrix are all 1:

the value calculation method of other layers of covering matrixes is shown in the step 7;

step 3: masking the feature map into k hierarchical convolutions; the convolution network of the ith hierarchy uses H _i Representation, corresponding generation of hierarchical feature graphs

FH _i ＝H _i (FM _i )(4)

Fseg _i ＝FH _i *Kseg _i (5)

wherein represents a convolution operation;

Assuming that the i-th layer classification convolution kernel is Kclsi, the expression of the classification feature map is:

Fcls _i ＝Fseg _i *Kcls _i (6)

If the global pooling operation is denoted by P, the classified activation vector is:

Acls _i ＝P(Fcls _i )(7)

when pooling is global average pooling, the classified activation vectors are:

when pooling is global maximum pooling, the classified activation vectors are:

step 7: mapping the classified probability vectors Aprob through Softmax function _i The method comprises the steps of carrying out a first treatment on the surface of the The probability of class j is:

step 8: generating i+1th level masking matrix from segmentation feature map

Firstly, normalizing the value of the ith hierarchical segmentation feature map to interval 0 to 1 to obtain a normalized feature map

Wherein epsilon has the effect of guaranteeing the stability of division;

wherein the threshold is denoted by gamma;

step 9: completing the establishment of a hierarchical convolution network; judging whether the current hierarchical level reaches the maximum level number k or not; if the termination level convolution is satisfied, otherwise, repeating the step 2-8;

step 10: a joint hierarchical convolutional network; connecting the hierarchical feature graphs output by all hierarchical convolution networks together to obtain a joint feature graph

Fcomb＝concate(FH ₁ ，FH ₂ ，...，FH _k )(14)

Wherein concate represents a feature map join operation, here performed in the feature map channel dimension;

step 11: sequentially obtaining a joint segmentation feature map, a joint classification activation vector and a joint classification probability vector by using the joint feature map; assuming that the joint segmentation convolution kernel and the joint classification convolution kernel are Kcomb_seg and Kcomb_cls, respectively; the operation mode is consistent with the steps 4-7:

wherein the method comprises the steps of

Step 12: establishing an image classification objective function; the objective function comprises a hierarchical classification loss function and a joint classification loss function; both the two classification loss functions are obtained by cross entropy calculation of respective classification activation vectors and class labels; the hierarchical classification loss function is averaged and then is respectively weighted with the joint classification loss function to form a unit; the method comprises the following steps:

wherein the analog label y is encoded by one-hot, and is 1 only when the image has a target, and is 0 in other cases;

step 13: calculating error loss by using equation as objective function through backward transmissionThe broadcasting algorithm adjusts the network phi, H _i ，Kseg _i ，Kcls _i Kcomb_seg and Kcomb_cls, the model composed of all the above networks and parameters is denoted by ψ; wherein i is between 1 and k; repeating training for s step sizes;

step 14: predictive segmentation result graph using trained model ψ

Pseg＝argmax(Fcomb_seg)(17)