CN110675403A

CN110675403A - Multi-instance image segmentation method based on coding auxiliary information

Info

Publication number: CN110675403A
Application number: CN201910814122.7A
Authority: CN
Inventors: 吴庆波; 李辉; 魏浩冉; 吴晨豪; 李宏亮; 孟凡满
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2020-01-10
Anticipated expiration: 2039-08-30
Also published as: CN110675403B

Abstract

The invention discloses a multi-instance image segmentation method based on coding auxiliary information, and belongs to the technical field of image coding and instance segmentation. The invention provides a multi-instance image segmentation method based on coding auxiliary information, aiming at the defects caused by the fact that the existing multi-instance image segmentation method only uses an original image to perform instance segmentation. The invention obtains the brightness and color difference macro blocks with different sizes from the input image through the image decoding algorithm, and extracts the intra-frame prediction direction information, thereby taking the obtained coding unit scale spectrum and the intra-frame prediction direction spectrum as the coding auxiliary information and fully utilizing the information of the image. The invention applies the long and short term memory network to the fields except for text classification and natural language processing, and fuses the scale spectrum, the direction spectrum and the original image together by using the long and short term memory network, thereby improving the accuracy of multi-instance image segmentation.

Description

Multi-instance image segmentation method based on coding auxiliary information

Technical Field

The invention relates to the technical field of image coding and instance segmentation, in particular to a multi-instance image segmentation method based on coding auxiliary information.

Background

In the field of computer vision, when images are subjected to multi-instance segmentation, each image of a training set only needs to be input into a segmentation network in the traditional method, but the information of the images is not fully utilized, so that the segmentation effect is not greatly improved, and the difference between the accuracy of classification and detection tasks is obvious.

Because the data volume of the image is large, the image in real life is a compressed image obtained by an image compression algorithm for the convenience of storage. For example, JPEG images are obtained by JPEG compression algorithms; the video is obtained by an H.264/HEVC video compression algorithm. Video and image compression algorithms typically go through the processes of color mode conversion (RGB-YUV), sampling, chunking, Discrete Cosine Transform (DCT), Zigzag ordering, quantization, and entropy coding. In an image or video encoding process, the encoding and intra prediction directions are different for different sizes of luminance and color difference macroblocks. Since the same size of the coding unit or the same intra-frame prediction direction indicates that the information correlation is strong, if the scale information of the coding unit and the intra-frame prediction direction information are combined when the image is subjected to multi-instance segmentation, compared with the traditional method, the method takes more image information into consideration, and is beneficial to improving the effect of a multi-instance segmentation network.

Disclosure of Invention

The invention aims to: aiming at the defects caused by the fact that the existing multi-instance image segmentation method only uses the original image to perform instance segmentation, the multi-instance image segmentation method based on the coding auxiliary information is provided. The invention extracts the coding unit scale information and the intra-frame prediction direction information from the image to obtain the corresponding scale spectrum and direction spectrum as auxiliary information, and fuses the original image and two characteristic spectrums through a long-short term memory network (LSTM) to carry out multi-instance image segmentation

The invention relates to a multi-instance image segmentation method based on coding auxiliary information, which comprises the following steps:

step 1, setting a segmentation network based on a convolutional neural network:

the segmentation network comprises a characteristic pyramid network, a proposed generation network, a sensitive region extraction network, a mask prediction network and a full connection layer;

the system comprises a characteristic pyramid network, a proposal generation network, a feeling region extraction network and a characteristic spectrum extraction network, wherein the characteristic pyramid network is used for characteristic extraction, and the obtained characteristic spectrum is respectively input into the proposal generation network and the feeling region extraction network;

the proposal generation network is used for generating a bounding box proposal and inputting the bounding box proposal into the sensitive area extraction network;

a sense region extraction network for extracting a sense region; classifying and frame regression are carried out on the areas with the feeling of fun through two full connection layers respectively;

two branches of a mask prediction network connecting the outputs of the sensitive region extraction network;

wherein, the first branch comprises four identical convolution layers and a deconvolution layer which are connected in sequence; the deconvolution layer outputs a characteristic spectrum of M K, wherein M is a preset value, and K represents the number of classes of samples;

the second branch comprises two convolution layers, a full-connection layer and a deformation layer which are connected in sequence; the fully connected layers are used to obtain 1 x M²The deformation layer is used for generating a characteristic spectrum of M1 predicting the foreground and the background;

connecting feature spectrums of two branches of the mask prediction network in series to obtain a prediction mask of M (K + 1);

and the Loss function Loss of the segmented network is; loss is less_cls+loss_box+loss_maskWherein, loss_cls、loss_boxAnd loss_maskRespectively representing classification loss, frame regression loss and mask loss of the segmentation network;

step 2, carrying out convolutional neural network training processing on the segmentation network;

collecting training sample pictures and extracting fusion characteristics of the training sample pictures;

initializing network parameters of the segmentation network, inputting the fusion characteristics of the training sample picture into the segmentation network, and obtaining the fusion characteristics based on classification output, frame regression output and mask prediction network output

Obtaining classification Loss, frame regression Loss and mask Loss of the segmentation network respectively from the difference between the real classification and the segmentation frame and the mask, thereby obtaining a current Loss function Loss;

based on the classification result output by classification, the segmentation frame output by frame regression and the prediction mask output by the mask prediction network, the difference between the classification result output by classification, the segmentation frame and the prediction mask output by the mask prediction network and the real classification, the segmentation frame and the mask is respectively obtained to obtain classification loss, frame regression loss and mask loss;

when the change rate of the Loss function Loss does not exceed a preset threshold value, stopping training, and obtaining a trained segmented network based on the current network parameters of the segmented network;

step 3, extracting the fusion characteristics of the picture to be segmented, inputting the fusion characteristics into a trained segmentation network, and outputting a multi-instance image segmentation result of the picture to be segmented based on classification and frame regression;

the extraction mode of the fusion features is as follows:

carrying out image decoding (image and video decoding processing) on the picture to be subjected to the fusion feature extraction to obtain brightness and color difference macro blocks with different sizes and intra-frame prediction direction information; different labels are respectively allocated to coding units with different scales in the image to obtain a coding unit scale spectrum; different labels are distributed to different intra-frame prediction directions to obtain an intra-frame prediction direction spectrum of the image;

and performing feature fusion on the obtained scale spectrum and direction spectrum with the original image through LSTM to obtain fusion features:

inputting the picture to be extracted with the fusion features into a first LSTM network, and obtaining a feature spectrum h1 based on the output of the first LSTM network;

the characteristic spectrum h1 is connected with the coding unit scale spectrum in series and then input into a second LSTM network, and a characteristic spectrum h2 is obtained based on the output of the second LSTM network;

and then the feature spectrum h2 and the intra-frame prediction direction spectrum are input into a third LSTM network after being connected in series, and the fusion feature of the picture of which the fusion feature is to be extracted is obtained based on the output of the third LSTM network.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

(1) the invention obtains the brightness and color difference macro blocks with different sizes from the input image through the image decoding algorithm, and extracts the intra-frame prediction direction information, thereby taking the obtained coding unit scale spectrum and the intra-frame prediction direction spectrum as the coding auxiliary information and fully utilizing the information of the image.

(2) The invention uses LSTM in the fields except text classification and natural language processing, and uses LSTM to fuse the scale spectrum, direction spectrum and original image together, thus improving the accuracy of multi-instance image segmentation.

Drawings

Fig. 1 is a diagram illustrating an implementation process of the present invention in an embodiment.

FIG. 2 is a schematic diagram of an LSTM feature fusion mode in an embodiment.

Fig. 3 is a block diagram schematically illustrating a structure of a split network according to an embodiment.

Fig. 4 is a block diagram illustrating a Mask branch structure according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

The invention extracts the coding unit scale information and the intra-frame prediction direction information of the image through the image and video decoding algorithm, and avoids the defect that the traditional method only uses the original image to perform example segmentation because the information of the image is fully utilized; the traditional long and short term memory network (LSTM) is widely used in the fields of text classification and natural language processing, and the present invention utilizes LSTM to perform fusion of different feature spectra.

Referring to fig. 1, the multi-instance image segmentation method based on coding auxiliary information of the present invention includes four parts, i.e. inputting an image, decoding the image, and fusing and segmenting LSTM features, i.e. the present invention mainly includes an image decoding module, an LSTM feature fusing module, and a convolutional neural network multi-instance segmenting module, which is specifically implemented as follows:

A. an image decoding module: for each training sample picture, different sizes of luminance, color difference macroblocks (4 × 4, 8 × 8, 16 × 16) and intra prediction direction information (vertical, horizontal, DC, lower left diagonal mode, lower right diagonal mode, vertical right mode, horizontal down mode, vertical left mode and horizontal up mode) can be obtained through an image and video compression decoding algorithm. And respectively allocating different labels (label) to coding units of 4 x 4, 8 x 8 and 16 x 16 in the image to obtain a coding unit scale spectrum, and allocating different labels (label) to different intra-frame prediction directions to obtain an intra-frame prediction direction spectrum of the image.

B. LSTM feature fusion module: and carrying out feature fusion on the scale spectrum and the direction spectrum obtained by decoding the image and the original image through LSTM. The LSTM is commonly used in the field of natural language processing, and can better solve the long-term dependence problem. The LSTM comprises a forgetting gate, an input gate and an output gate, the original image, the coding unit scale spectrum and the intra-frame prediction direction spectrum are respectively input into the LSTM, and feature fusion is carried out on the original image, the coding unit scale spectrum and the intra-frame prediction direction spectrum under the control of gate control signals, and finally a result after feature fusion is obtained.

Referring to fig. 2, the specific process of feature fusion using LSTM is as follows:

firstly, inputting an original image X1 into an LSTM network to obtain a characteristic spectrum h1, serially connecting h1 with a coding unit scale spectrum X2 (concatanate), then inputting the obtained result into the LSTM network to obtain a characteristic spectrum h2, serially connecting h2 with an intra-frame prediction direction spectrum X3, inputting the obtained result into the LSTM network to obtain a finally fused characteristic spectrum h3, and inputting the obtained result into a segmentation network.

C. A convolutional neural network multi-instance segmentation module: and (4) putting the fused features obtained by the operation B into a segmentation network for training. And then realizing multi-instance segmentation processing of the image to be segmented based on the trained segmentation network to obtain a corresponding segmentation result.

The framework of the convolutional neural net based segmentation network of the present invention is shown in fig. 3.

C1, FPN feature extraction: firstly, inputting a fusion result (H3) into a Feature Pyramid Network (FPN) for feature extraction to obtain a feature spectrum f of H W C, and then obtaining H through a convolution layer CONV1₁*W₁*C₁Characteristic spectrum f of₁。

The Feature extraction process of the Pyramid network can refer to the document "Feature Pyramid Networks for object Detection, Kaiming He".

C2, proposal (propofol) generation: to obtain an effective propofol, the profile f is compared₁Each point in (a) predicts 9 area proposals, i.e. bounding box proposals. To determine foreground and background, the image is passed through convolutional layer CONV2 (convolution kernel size H)₁*W₁*C₁Number of convolution kernels is 18) to obtain 1 × 18 output, and calculating the probability that the 9 region proposals belong to the foreground and the background respectively by using a normalized exponential function (Softmax); to obtain the coordinates of the center point, the length and the width of the bounding box, the bounding box is passed through convolutional layer CONV3 (convolution kernel size H)₁*W₁*C₁Number of convolution kernels 36) yields an output of 1 x 36.

C3, ROI sensitive region extraction: corresponding features, namely the features corresponding to the positions, are extracted from the feature spectrum through the generated bounding box proposal, and the corresponding features are extracted from the feature spectrum according to the position coordinates of the generated bounding box proposal.

And the ROI Align divides the extracted features into N x N rectangular blocks, each rectangular block is divided into 4 sub rectangular blocks, the central points of the 4 sub rectangular blocks are obtained through bilinear interpolation, and then the 4 central points are subjected to maximum pooling operation, so that the N x N256 region of interest (ROI) is finally obtained. Wherein N is a preset value, set based on actual application scenarios and requirements.

C4, classification and bounding box regression: the region ROI of interest was classified and frame-regressed by passing through two fully connected layers with parameters (N × 256-1024) and (1 × 1024-1024), respectively.

C5, Mask (Mask) prediction: the first branch of the mask firstly passes through 4 identical convolution layers (conv1, conv2, conv3 and conv4), the size of a convolution kernel is 3 x 256, the number of the convolution kernels is 256, the convolution step (stride) is set to be 1, the convolution operation padding is set to be 'SAME', so that the resolution of a characteristic spectrum is kept unchanged, and then the characteristic spectrum of M x K is output through a deconvolution layer with the factor (factor) of 2, wherein K represents the class number of samples;

the second branch connects the convolution layer conv3 output with convolution layer conv4_ FC (convolution kernel size: 3 × 256, number of convolution kernels: 256, convolution step: 1, padding: 'SAME') and conv5_ FC (convolution kernel size: 3 × 256, number of convolution kernels: 128, convolution step: 1, padding: 'SAME'), and then through a fully connected layer FC layer (N × 128-M FC layer)²) To obtain 1 x M²The feature vector of (3), the feature spectrum transformed (reshape) into M × 1 is used to predict the background and the foreground.

And (3) connecting the two branched feature spectrums in series (concatenate), obtaining a mask of M (K +1), and calculating the loss of the mask. A block diagram of which is shown in fig. 4.

D. And (4) calculating a loss function. The Loss function Loss of the whole network consists of three parts: loss of classification loss_clsLoss of bounding Box regression_boxLoss of mask_mask. The overall network penalty is therefore:

Loss＝loss_cls+loss_box+loss_mask

the convolutional neural network training processing of the segmentation network is achieved based on the Loss function Loss, when the Loss function Loss does not change obviously any more, namely the change rate does not exceed a preset threshold value, the training is stopped, and the trained segmentation network is obtained based on the current network parameters of the segmentation network.

According to the invention, the input image is subjected to image decoding algorithm to obtain the brightness and color difference macro blocks with different sizes, and the intra-frame prediction direction information is extracted, so that the obtained coding unit scale spectrum and the intra-frame prediction direction spectrum are used as coding auxiliary information, and the information of the image can be fully utilized. By using LSTM in fields other than text classification and natural language processing, the scale spectrum, direction spectrum, and original image are fused together with LSTM, so the accuracy of multi-instance image segmentation can be improved.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A multi-instance image segmentation method based on coding side information, comprising the steps of:

step 1, setting a segmentation network based on a convolutional neural network:

second branchComprises two convolution layers, a full-connection layer and a deformation layer which are connected in sequence; the fully connected layers are used to obtain 1 x M²The deformation layer is used for generating a characteristic spectrum of M1 predicting the foreground and the background;

the extraction mode of the fusion features is as follows:

carrying out image decoding processing on the picture to be subjected to the extraction of the fusion characteristics to obtain brightness and color difference macro blocks with different sizes and intra-frame prediction direction information; different labels are respectively allocated to different scales of the color difference macro block to obtain a scale spectrum of the coding unit; different labels are distributed to different intra-frame prediction directions to obtain an intra-frame prediction direction spectrum of the image;

2. The method of claim 1, wherein the color difference macroblock comprises three dimensions when extracting the fusion feature, respectively: 4 x 4, 8 x 8, 16 x 16;

the intra prediction directions include vertical, horizontal, DC, lower left diagonal mode, lower right diagonal mode, vertical right mode, horizontal down mode, vertical left mode, and horizontal up mode.

3. The method of claim 1, wherein the processing procedure of the perceptual area extraction network is:

extracting corresponding features from a feature spectrum output by the feature pyramid network based on a bounding box proposal generated by a proposal generation network;

dividing the extracted features into N x N rectangular blocks, dividing each rectangular block into 4 small rectangular blocks, obtaining the central points of the 4 rectangular blocks through bilinear interpolation, and performing maximum pooling operation on the 4 central points to obtain an N x N256 sensitization area, wherein N is a preset value.