CN110675421A

CN110675421A - Depth image collaborative segmentation method based on few labeling frames

Info

Publication number: CN110675421A
Application number: CN201910813756.0A
Authority: CN
Inventors: 孟凡满; 鲍俊玲; 黄开旭; 李宏亮; 吴庆波
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2020-01-10
Anticipated expiration: 2039-08-30
Also published as: CN110675421B

Abstract

The invention discloses a depth image collaborative segmentation method based on a small number of labeling frames, and belongs to the field of image processing. The invention provides a depth image collaborative segmentation network based on a convolutional neural network, and a method for realizing image collaborative segmentation only by using image-level labels and a small number of artificial labeling frames. The method not only considers the internal information of the images, but also utilizes the correlation between the images by combining the depth semantic features extracted by the convolutional neural network, so that the images can obtain more accurate segmentation results under the condition of weak supervised learning.

Description

Depth image collaborative segmentation method based on few labeling frames

Technical Field

The invention belongs to the field of image processing, and particularly relates to a depth image collaborative segmentation method based on image-level labels and a small number of artificial labeling frames.

Background

Image segmentation is a fundamental research in the field of image processing and is also a key step in the field of computer vision. The image segmentation technology is widely applied to the fields of unmanned driving, intelligent medical treatment, medical image analysis, remote sensing meteorological service, military engineering and the like. The image segmentation technology divides an image into a plurality of object areas according to semantic information, and is the first step of image analysis and also a critical step.

Image co-segmentation is an important branch of image segmentation. The image collaborative segmentation aims to extract a common object region from a group of images, can be widely applied to computer vision and multimedia processing, and has recently received wide attention. However, the collaborative segmentation of images is a challenging task due to the insufficient target information and the existence of much information interference. Particularly, under the conditions of variable foreground and complex background, how to effectively construct an image collaborative segmentation model and realize semantic region similarity measurement has become an urgent need for computer vision and artificial intelligence research.

The image collaborative segmentation aims to extract similar object regions from a group of images, and the difficulties are establishment and optimization of a model, measurement of foreground consistency of local regions, extraction of common object information and the like. The traditional collaborative segmentation method adopts the characteristics of manual design, is difficult to describe the semantic property of the region, and cannot effectively measure the consistency of the region. Semantic collaborative segmentation of objects is difficult to achieve in a complex context. For example, most recently proposed image collaborative segmentation methods still rely on the characteristics of color, shape, texture, etc. of the image. The image characteristics which are bottom layer and do not contain semantic information are difficult to cooperate with the image cooperation segmentation work which has certain challenges in cooperation segmentation.

Convolutional neural networks have achieved significant performance breakthrough in many areas in recent years, including object detection, object classification, speech recognition, object localization, and video generation, among others. Since convolutional neural networks contain millions of network parameters, image features in different layers and structures can be learned automatically. Compared with the traditional method, the features abstractly learned by the convolutional neural network contain strong semantic information, and the method has strong robustness on the changes of the shape size, the spatial position, the direction and the like of the target object. The convolutional neural network enables the machine to better learn the mapping relation between input and output, and automatically generates the optimal characteristics through full learning, thereby well replacing the process of manually selecting the characteristics. Because of these advantages, convolutional neural networks have gained increasing attention in many areas.

At present, the existing image collaborative segmentation method is difficult to realize the collaborative segmentation of the complex image, so the image collaborative segmentation algorithm combined with the convolutional neural network is an effective method for realizing the complex image collaborative segmentation algorithm. However, the existing convolutional neural network needs a large amount of manual labeling data, and the image collaborative segmentation algorithm only contains image-level coarse labels, so that it is difficult to provide rich and accurate training samples, and the realization of image collaborative segmentation in combination with the convolutional neural network is a difficult task.

Disclosure of Invention

The invention aims to: aiming at the existing problems, the method can realize a better image segmentation result by only using a small amount of manual labeling frames, and realizes the weak supervision image collaborative segmentation according to the internal information of the image and the information between the images.

The invention discloses a depth image collaborative segmentation method based on a small number of labeling frames, which comprises the following steps:

step 1: setting a depth image collaborative segmentation network:

the depth image collaborative segmentation network comprises two branches, and a first branch network comprises a feature extraction network and a segmentation network; the second branch network comprises a feature extraction network and a segmentation network, and the network structures of the two branch networks are the same;

the feature extraction network adopts a convolutional neural network structure and is used for extracting the depth features of the input image;

the segmentation network also adopts a convolutional neural network structure and is used for carrying out foreground segmentation on the input image based on the depth features extracted by the feature extraction network to obtain a segmentation result;

step 2: carrying out deep learning training on the depth image collaborative segmentation network:

setting an image set S comprising all requirement categories based on the category requirements to be segmented, and selecting a small number of sample images for each image category from the image set S, namely selecting a certain number of sample images for each image category to obtain an image set S1; the number of the selected images is far smaller than that of the image set to be segmented; for example, 1-10 sample images are respectively selected for each image type; giving a marking frame representing a foreground region to the selected sample image, namely manually framing the foreground region;

performing image preprocessing on all images included in the image set S and the image set S1, wherein the image preprocessing comprises image graying and size normalization processing; the normalized size is matched with the input of the depth image collaborative segmentation network;

initializing network parameters of the depth image collaborative segmentation network, and taking the image set S1 and the label box thereof as the input of a first branch network of the depth image collaborative segmentation network; the image set S is used as the input of a second branch network;

extracting the output of the network based on the characteristics of each branch network to obtain the depth characteristics of the input image; obtaining a segmentation result of the input image based on the output of the segmentation network of each branch network;

and setting a loss function L of the depth image collaborative segmentation network as follows: cross entropy loss function L_cAnd a weighted sum of the similarity loss function sim;

wherein L is_c＝∑_xy(x)log(y⁰(x))+(1-y(x))log(1-y⁰(x))，y⁰A segmentation result representing the output of the first branch network, y representing a label box, and x representing a pixel of the input image;

the similarity loss function sim is calculated as:

respectively down-sampling the segmentation results output by the segmentation networks of the two branch networks to the same depth characteristic size output by the characteristic extraction network, and then respectively matching the segmentation results of each branch network with the depth characteristics to obtain a first branch depth characteristic and a second branch depth characteristic after matching;

wherein the matching process is as follows:

corresponding pixel positions of 0 and non-0 in the downsampled segmentation result to corresponding positions of the depth feature, wherein the corresponding position of the pixel of 0 is a background feature of the depth feature, and the corresponding position of the pixel of non-0 is a foreground feature of the depth feature;

solving Euclidean distances of each foreground feature in the matched first branch depth features and each foreground feature in the matched second branch depth features, summing all the Euclidean distances and then averaging to obtain parameters

Solving Euclidean distances from each foreground feature and each background feature in the second branch depth features, summing all the Euclidean distances and then averaging to obtain parameters

And according to the formula

Calculating to obtain a similarity loss function sim;

in the deep learning training process of the depth image collaborative segmentation network, the set convergence conditions are as follows: stopping training when the change rate of the loss function L output by the loss function L for the last two times does not exceed a preset change rate threshold;

and step 3: carrying out image preprocessing on an image to be segmented, wherein the image preprocessing comprises image graying and size normalization processing;

and inputting the preprocessed image to be segmented into the trained depth image collaborative segmentation network, and converting the segmentation result output by the depth image collaborative segmentation network into the original image size of the image to be segmented as the segmentation result of the image to be segmented.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention provides a depth image collaborative segmentation network based on a convolutional neural network, and a method for realizing image collaborative segmentation only by using image-level labels and a small number of artificial labeling frames. The method not only considers the internal information of the images, but also utilizes the correlation between the images by combining the depth semantic features extracted by the convolutional neural network, so that the images can obtain more accurate segmentation results under the condition of weak supervised learning.

Drawings

FIG. 1 is a schematic diagram of the deep collaborative segmentation network framework of the present invention;

fig. 2 is a representation of the segmentation results of several images selected in the data set used in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Most algorithms for realizing image segmentation by using the convolutional neural network have the basis of obtaining a better segmentation result by consuming a large amount of labor cost to obtain a large amount of manually labeled samples for training. In order to reduce the investment of a large amount of labor cost, the invention provides a method for realizing a better image segmentation result by only using a small amount of manual labeling frames. For an image set with similar object regions, the invention sets a depth convolution neural network by only using a small number of artificial labeling frames of image-level labels of the image, measures the similarity of depth semantic features and realizes the collaborative segmentation of the image.

The present image segmentation method, whether the strong supervision learning or the weak supervision learning, only utilizes the intrinsic information of the images to carry out segmentation, and does not consider the information between the images, so the invention provides a method for carrying out collaborative segmentation by using the images, and considers the intrinsic information of the images and the information between the images together to solve the problem of image segmentation. For the image collaborative segmentation method, most of the existing image collaborative segmentation methods only utilize the bottom layer characteristics of the image such as color, shape, texture and the like to carry out similarity measurement, and the complex image is difficult to have a better segmentation result.

The invention provides a novel depth image collaborative segmentation network based on a convolutional neural network, which utilizes image-level labels and a small number of artificial labeling frames of an image, sets a network loss function according to depth semantic features obtained by the network and realizes the segmentation of a public area of the image. The method not only makes up the defect that a convolutional neural network needs to use a large amount of manual labeling, but also can well combine the image collaborative segmentation algorithm with the convolutional neural network to realize the collaborative segmentation of the complex image. The specific implementation process of the invention is as follows:

step one, data preparation.

1.1 select a deep learning image segmentation dataset.

In this embodiment, the deep learning image segmentation data set is an Icoseg data set and a paschalco 2012 validation set, the Icoseg data set includes 38 categories such as pandas, football teams, kites, pyramids, and the like, and total 643 images, and the PASCAL VOC2012 validation set includes 20 categories such as airplanes, bicycles, cats, cows, and the like, and total 1449 images. In addition, for the Icoseg dataset, two images are picked out for each category, for the PASCAL VOC2012 dataset, four or five images are picked out for each category, and the manual labeling box of the picked images, i.e., the foreground region of the framed images, is given. The selected images are taken as an image set S1, the images of the whole image set are taken as image sets S2, S1 and S2 which contain the same category, wherein the number of the images in S2 is far larger than that in S1.

1.2 all images are normalized to 320 x 320 to fit the input of the network, i.e. to match the input of the deep learning network to be trained.

1.3 in this embodiment, the network is provided to include two branches; in training, the images in the image set S1 and their label boxes are input as the first branch, and the images in the image set S2 are input as the other branch.

When inputting, the first branch and the second branch are inputted as the same category images, and the number of the images of the first branch S1 is much smaller than that of the images of the second branch S2, so that when inputting, one image of S2 is used as the input of the second branch, and any image of the same category as that of the second branch is selected from S1 as the input of the first branch.

And step two, constructing a depth image collaborative segmentation network.

Referring to fig. 1, the depth image collaborative segmentation network of the present invention includes two branch networks, a first branch network includes a feature extraction network and a segmentation network; the second branch network comprises a feature extraction network and a segmentation network, and the network results of the two branch networks are the same; the feature extraction network adopts a convolutional neural network structure and is used for extracting the depth features of the input image; the segmentation network also adopts a convolutional neural network structure and is used for carrying out foreground segmentation on the input image based on the depth features extracted by the feature extraction network to obtain a segmentation result, namely a binarized foreground image;

that is, the framework structure of the depth image collaborative segmentation network according to the present invention is a framework structure of a Full Convolution Network (FCN), and the loss function thereof employs a cross entropy loss function and a similarity loss function.

Preferably, the network structure of the feature extraction network of the depth image collaborative segmentation network adopts a 16-layer VGG (visual Geometry group) network structure, namely VGG-16, the initial parameters of the VGG-16 network are preferably parameters of the VGG-16 network obtained by training on imgtet 1000 images, and the initial learning rate is set to 1.0e^-10The momentum, and weight decay are set to 0.9 and 0.0005, respectively, as the number of iterations increases and decreases gradually.

In this embodiment, the partitioned network has an FCN network structure.

And step three, constructing a similarity measurement method.

3.1 cooperatively segmenting the depth image into two networks, wherein the depth image is characterized by a 4096-dimensional feature output by the last convolutional layer of a feature extraction network (a convolutional neural network for extracting image features, and a specific network structure can select an existing convolutional neural network for extracting image features, such as that shown by VGG-16, or can be set automatically based on application requirements) of two branched networks in the network, and the feature size is 11 x 4096.

Meanwhile, based on the depth features extracted by the feature extraction network, a segmentation result (a binarized foreground image) of the input image is obtained through a segmentation network (based on a convolutional neural network structure) of two branch networks;

3.2 downsampling the segmentation result (size 320 × 320) of the two branches of the network to the same size as the depth feature (size 11 × 11), matching the segmentation result with the depth feature, that is: and corresponding the pixel positions of 0 and non-0 in the downsampled segmentation result to the corresponding positions of the depth feature, wherein the corresponding position of the pixel of 0 is the background feature, and the corresponding position of the pixel of non-0 is the foreground feature.

3.3 solving the Euclidean distance of each foreground feature in the matched first branch depth feature and each foreground feature in the second branch depth feature, and summing all the distances and then averaging. And solving Euclidean distance between each foreground characteristic and each background characteristic in the depth characteristics of the second branch, summing all the distances, then averaging, and calculating a similarity score sim, wherein the specific formula is as follows:

wherein the content of the first and second substances,

representing the ith foreground feature in the depth features of the first branch,

representing the jth foreground feature in the second branch depth feature,

and representing the kth background feature in the second branch depth feature, N representing the total number of Euclidean distances ff of the foreground features, and M representing the total number of Euclidean distances fb of the background features.

Step four, setting a loss function of the depth image collaborative segmentation network

4.1 the cross entropy loss function is a loss function commonly used for segmenting the network, the commonly used cross entropy loss function is to measure the difference between the segmentation result and the pixel-level label of the artificial label, and the cross entropy loss in the image collaborative segmentation network of the invention is to measure the difference between the segmentation result and a small number of labeled frames, and the cross entropy loss is used as a constraint of the network, and the formula is as follows:

wherein, y⁰The segmentation result of the first branch image is shown, y represents the corresponding artificial labeling box, and x represents the pixel.

4.2 the similarity score is taken as another constraint of the network.

4.3 the loss function of the whole network consists of the cross entropy loss function and the similarity score, and the formula is as follows:

L＝α*L_c+β*sim

according to experiments, the best effect is obtained when alpha is 1 and beta is 1.

In the depth image collaborative segmentation network training process, the iterative convergence condition is that the change rate of the loss function of the whole network does not exceed a preset change rate threshold.

The rate of change being expressed by gamma, i.e.

The change rate threshold value is set to 3 × 10 in the present embodiment^-4Wherein L is_N-1、L_NThe loss function L of the whole network at N, N-1 th iteration respectively, namely the change rate of the loss function L of the last two iterations (at the time of training) does not exceed a preset change rate threshold value.

And step five, obtaining a segmentation result.

And 5.1, inputting the image set into a deep collaborative segmentation network, training to obtain segmentation results of all images, wherein the size of the segmentation results is 320 × 320, and performing size transformation on the segmentation results to the size of the original image to obtain the final segmentation result.

Firstly, training the deep collaborative segmentation network of the invention, and then segmenting an image to be segmented based on the trained deep collaborative segmentation network, specifically:

carrying out image preprocessing, graying and size conversion processing on an image to be segmented so that the image to be segmented is matched with the input to the depth collaborative segmentation network;

and inputting the segmentation result size output by the degree collaborative segmentation network into the original image size to be segmented to obtain the final segmentation result.

Training the deep collaborative segmentation network of the invention based on the deep learning image segmentation data set prepared in the first step, inputting the verification set into the trained deep collaborative segmentation network to perform segmentation processing on the image to be segmented, and obtaining the segmentation result as shown in fig. 2. The image pixel level labels given in fig. 2 are only for comparison with the segmentation results of the present invention.

In order to objectively score the segmentation result of the deep learning image segmentation network of the present invention and determine the segmentation performance, in the present embodiment, the IOU value (objective score) is calculated using the image segmentation result and the pixel-level label of the image in the data set, and the IOU value calculation formula is as follows:

wherein GT_iPixel level label R of image as segmentation reference standard corresponding to ith segmentation result_iIndicating the ith segmentation result.

The objective score IOU of the segmentation result shown in fig. 2 of the present invention is 0.75. The result of verification on the existing data set shows that the depth image collaborative segmentation method based on a small number of labeling frames has a very good effect on segmenting the common area of the image.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The depth image collaborative segmentation method based on a small number of labeling frames is characterized by comprising the following steps:

step 1: setting a depth image collaborative segmentation network:

setting an image set S comprising all requirement categories based on the category requirements to be segmented, selecting a certain number of sample images for each image category from the image set S to obtain an image set S1, and giving out an annotation frame representing a foreground region to the selected sample images; the number of the selected sample images is far smaller than that of the image set to be segmented;

performing image preprocessing on all images included in the image set S and the image set S1, wherein the image preprocessing comprises image graying and size normalization processing;

the similarity loss function sim is calculated as:

wherein the matching process is as follows:

Identifying each foreground feature in the second branch depth featuresCalculating Euclidean distance from each background feature, summing all Euclidean distances and averaging to obtain parameters

And according to the formula

Calculating to obtain a similarity loss function sim;

in the deep learning training process of the depth image collaborative segmentation network, the set convergence conditions are as follows: stopping training when the change rate of the loss function L output in the last two times does not exceed a preset change rate threshold value;

2. The method of claim 1, wherein the cross-entropy loss function L is computed as the loss function L is computed_cAnd the weight of the similarity loss function sim are both set to 1.

3. The method of claim 1, wherein the change rate threshold is set to 3 x 10 during deep learning training of the deep image collaborative segmentation network^-4。

4. The method of claim 1, wherein the network structure of the feature extraction network employs a 16-layer convolutional VGG.