CN108985298B

CN108985298B - Human body clothing segmentation method based on semantic consistency

Info

Publication number: CN108985298B
Application number: CN201810631795.4A
Authority: CN
Inventors: 李玺; 吉炜
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-06-19
Filing date: 2018-06-19
Publication date: 2022-02-18
Anticipated expiration: 2038-06-19
Also published as: CN108985298A

Abstract

The invention discloses a human body clothing segmentation method based on semantic consistency, which is used for analyzing the semantic condition of a clothing region of each frame under the condition of giving a single-frame single-person clothing picture. The method specifically comprises the following steps: acquiring an image data set for training human body clothes segmentation, and defining an algorithm target; searching adjacent pictures of each single-frame image in the semantic space of the single-frame image in the data set and forming a picture pair; jointly modeling the neighborhood relationship of each set of picture pairs in the streaming space; establishing a prediction model for clothes segmentation; and analyzing semantic information of clothes in the picture by using the prediction model. The method is suitable for clothes segmentation analysis in a real image, and has better effect and robustness in the face of various complex conditions.

Description

Human body clothing segmentation method based on semantic consistency

Technical Field

The invention relates to the field of computer vision, in particular to a human body clothing segmentation method based on semantic consistency.

Background

The semantic segmentation of the clothes is used as a low-level visual technology and is often used as auxiliary information of some high-level visual tasks, such as clothes retrieval, clothes attribute analysis and the like. The aim of clothes segmentation is to give an image and predict the classification label of each pixel point in the image. The key factors of the clothes segmentation mainly include the characteristics of large interior appearance difference of the clothes types, non-rigidity of the clothes and extreme deformability of the clothes. The traditional method generally considers the clothes segmentation task as a semantic segmentation problem, and although some methods make a breakthrough in classification accuracy, the traditional method does not fully utilize the information of the existing data.

Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to semantic segmentation tasks. The existing learning-based method mainly adopts an end-to-end deep learning framework, inputs an original three-channel color picture and outputs a predicted semantic segmentation picture. The deep learning can effectively solve the problem of feature representation, but when the problem that enough accurately labeled data sets are lacked in clothes segmentation is faced, the effect of the deep learning is limited due to the insufficient data quantity, and meanwhile, reasonable features cannot be extracted by common convolution due to the deformable characteristic of clothes.

Disclosure of Invention

Aiming at the problems and the requirement of power distribution network construction in practice, the invention provides a human body clothes segmentation method based on semantic consistency. The technical scheme adopted by the invention is as follows:

a human body clothing segmentation method based on semantic consistency comprises the following steps:

s1, acquiring an image data set for training human body clothes segmentation, and defining an algorithm target;

s2, searching adjacent pictures of each single-frame image in the semantic space of the single-frame image in the data set and forming a picture pair;

s3, carrying out joint modeling on adjacent relations of each group of picture pairs in a flow type space (namely a semantic space);

s4, establishing a prediction model for clothes division;

and S5, analyzing semantic information of clothes in the picture by using the prediction model.

Preferably, the image data set in S1 includes a single frame image I_trainAnd artificially labeled semantic segmentation graph P_train(ii) a The algorithm aims at predicting the semantic segmentation map of the clothes in the single-frame image

Preferably, the S2 includes the following substeps:

s21, for each single frame image I_trainExtracting human body posture characteristic C from pre-trained human body posture estimation model Openpos_poseAnd image appearance characteristic C_appearanceBy cascading C_poseAnd C_appearanceObtaining a single frame image I_trainCorresponding picture feature G_I；

S22, calculating the similarity of the pictures in the image data set pairwise, wherein any two pictures I_trainAnd l'_trainThe similarity α of (a) is calculated as follows:

wherein C is_emptyIs a reaction with G_IPicture features of one image with the same picture size and all values of 0; c'_IIs for single frame image I'_trainAccording to a single frame image I_trainPicture characteristics obtained by the same method; euclidean () represents the calculation of the Euclidean distance;

s23, calculating and comparing the similarity for each single frame image I_trainThe most similar image I' is searched_trainTo obtain a picture pair (I)_train，I＇_train) And a corresponding similarity value alpha.

Preferably, the S3 includes the following substeps:

s31, using four layers of convolution operation and pooling operation to pair I in picture pair_trainAnd l'_trainRespectively extracting the features to obtain S_IAnd S'_INamely:

S_I＝f_single(I_train；θ)

S′_I＝f_single(I′_train；θ)

wherein f is_single() A function is constructed for the four-layer convolution operation and the pooling operation, and theta is a convolution parameter;

S32for feature S obtained in S31_IAnd S'_IPerforming fusion to obtain fused features S_interaction：

S_interaction＝(1-α)*S_I+α*S′_I

S33, using triple-layer convolution operation on fused features S_interactionReconstructing semantic information of the picture to obtain an image I with the size_trainOne-eighth sized semantic segmentation map

Simultaneous four-layer convolution and upsampling operations on a single image feature S_IReconstructing semantic information to obtain a semantic information reconstruction image I_trainSemantic segmentation graph with consistent size

S34, the operations of S31-S33 are performed on all the picture pairs.

Preferably, the S4 includes the following substeps:

s41, establishing a deep convolution neural network, wherein the input of the neural network is a pair of pictures (I)_train，I′_train) Output as being relative to picture I_trainBy semantically segmenting the image

The structure of the neural network is represented as a map

Is formulated as:

wherein theta is₁The convolution parameters used when the semantic segmentation result is predicted for the prediction model, and f () is a prediction function of the deep convolution neural network;

s42, the loss function of the neural network is as follows:

wherein P and P_smallRespectively represent

Corresponding true semantic segmentation image and

corresponding real semantic segmentation graph;

predictive semantic segmentation graph representing an original scale

And loss error of its true semantic picture;

predictive semantic segmentation graph representing small scales

And loss error of the real semantic picture, wherein the small scale is one eighth of the original scale; λ is a weight parameter;

s43, training the whole neural network under the loss function L by using an Adam optimization method and a back propagation algorithm until the neural network converges.

The method is based on the deep neural network, utilizes the adjacent relation of semantic information of similar pictures in a flow type space, adopts deformable convolution to model deformation characteristics of clothes, and can better adapt to clothes semantic segmentation under different scenes. Compared with the traditional clothes semantic segmentation method, the method has the following benefits:

firstly, the clothes semantic segmentation method defines three important problems in clothes semantic segmentation, namely the extreme deformable characteristic of clothes, the modeling of semantic consistent relation of similar pictures and the calculation accuracy. By seeking a solution of the two directions, the semantic segmentation of the clothes under the condition of insufficient data quantity can be effectively solved.

Secondly, the clothing semantic segmentation method is based on the deep convolutional neural network, a semantic consistency model is established, and meanwhile the method has the advantage of calculation accuracy. The deep convolutional neural network can better express visual features, in addition, the extraction of the visual features and the learning of the corresponding structural model are unified in the same frame, and the final effect of the method is improved.

Finally, the clothes semantic segmentation method of the invention provides that the semantic consistency relation of similar picture pairs is modeled by using a convolutional neural network, so as to predict clothes semantic segmentation, and the feature information of clothes is extracted by using variability convolution according to the characteristic of clothes variability. The method can effectively mine the semantic consistency relationship of the picture pairs with similar contents and keep the constraint on the semantic consistency structure in the semantic space.

The method can effectively improve the accuracy and efficiency of retrieval and analysis in the process of clothing retrieval and clothing attribute analysis, and has good application value. For example, in the application scene of clothing e-commerce retail, the method can quickly and accurately analyze the clothing regions and categories on the model, so that the clothing categories on the model can be quickly completed, and a basis is provided for e-commerce retail of the same type retrieval.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a graph showing the effect of the experiment according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

Referring to fig. 1, a human clothing segmentation method based on semantic consistency includes the following steps:

the image data set in this step includes a single frame image I_trainAnd artificially labeled semantic segmentation graph P_train(ii) a The algorithm aims at predicting the semantic segmentation map of the clothes in the single-frame image

the method comprises the following substeps:

s21, for each single frame image I_trainExtracting human body posture characteristic C from pre-trained human body posture estimation model Openpos_poseAnd image appearance characteristic C_appearanceBy concatenation (i.e. direct splicing) of C_poseAnd C_appearanceObtaining a single frame image I_trainCorresponding picture feature G_I；

And S22, calculating the similarity of every two pictures in the image data set to obtain the similarity of any two pictures. Wherein, any two pictures I_trainAnd l'_trainThe similarity α of (a) is calculated as follows:

the method comprises the following substeps:

s31, using four layers of convolution operation and pooling operation to pair of pictures (I)_train，I＇_train) In (1)_trainAnd I_trainRespectively extracting the features to obtain S_IAnd S'_INamely:

S_I＝f_single(I_train；θ)

S′_I＝f_single(I′_train；θ)

s32, characteristic S obtained in S31_IAnd S'_IPerforming fusion to obtain fused features S_interaction：

S_interaction＝(1-α)*S_I+α*S′_I

Wherein α is the similarity value of the group of pictures;

S34, the operations of S31-S33 are performed on all the picture pairs.

S4, establishing a prediction model for clothes division;

the method comprises the following substeps:

The structure of the neural network is represented as a map

Is formulated as:

s42, the loss function of the neural network is as follows:

wherein P represents

Corresponding true semantically segmented images, i.e. images I_trainTrue semantic segmentation graph, P_smallTo represent

Corresponding true semantic segmentation map, i.e. of size image I_trainOne eighth of a real semantic segmentation graph;

predictive semantic segmentation graph representing an original scale

And its true semantic picture (i.e., the labeled semantic segmentation map in S1);

predictive semantic segmentation graph representing small scales

And loss error of its true semantic picture (i.e., the labeled semantic segmentation map in S1), the small scale being one-eighth of the original scale, and

the dimensions remain the same; λ is a weight parameter, here the value is 0.125;

The above-described method is applied to specific examples so that those skilled in the art can better understand the effects of the present invention.

Examples

The implementation method of this embodiment is as described above, and specific steps are not elaborated, and the effect is shown only for case data. The invention is implemented on three data sets with truth labels, which are respectively as follows:

fashinostat v0.2 dataset: the data set contains 685 images and has 56 types of semantic labels.

Refined Fashionista dataset: the data set contains 685 images and has 25 types of semantic labels.

CFPD dataset: the data set contained 2682 images with 23 classes of semantic labels.

In this example, a picture is selected from each data set for experiment, the closest picture is obtained by calculating the similarity, then the features of the two pictures are respectively extracted, and the adjacent relation of the group of pictures in the streaming space is subjected to combined modeling, so as to obtain the final semantic segmentation graph, as shown in fig. 2. In the figure, the grountruth represents a real semantic segmentation graph, and the predicted semantic segmentation graph obtained by the method is basically consistent with the real semantic segmentation graph.

The detection precision of the detection result of the embodiment is shown in the following table, and the detection precision of various methods is mainly compared by using two indexes, namely average Acc and IoU, wherein the average Acc index refers to the accuracy of the classification result of each pixel point, and can better reflect the prediction result; IoU refers to the intersection ratio of the areas between the semantic region and the truth region. As shown in the table, the average Acc and IoU indexes of the method have obvious advantages compared with other traditional methods.

In the above embodiment, the clothing semantic segmentation method of the present invention first performs joint modeling on the adjacent relationship of each group of pictures with similar contents in the flow pattern space. On the basis, the original problem is converted into an end-to-end structure learning problem, and a clothing semantic segmentation model is established based on a deep neural network. And finally, predicting the clothes semantic information of a new frame by using the trained clothes semantic segmentation model.

Through the technical scheme, the embodiment of the invention develops a human body clothes segmentation method based on semantic consistency based on a deep learning technology. The method can utilize the adjacent relation of semantic information of similar pictures in a streaming space and adopt deformable convolution to model deformation characteristics of clothes, and can better adapt to clothes semantic segmentation under different scenes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A human body clothing segmentation method based on semantic consistency is characterized by comprising the following steps:

s3, carrying out joint modeling on the adjacent relation of each group of picture pairs in the flow type space;

s4, establishing a prediction model for clothes division;

s5, analyzing semantic information of clothes in the picture by using the prediction model;

the S3 includes the following substeps:

S_I＝f_single(I_train；θ)

S′_I＝f_single(I′_train；θ)

S_interaction＝(1-α)*S_I+α*S′_I

Simultaneous use of four-layer convolution and up-cutSample operation on single image features S_IReconstructing semantic information to obtain a semantic information reconstruction image I_trainSemantic segmentation graph with consistent size

S34, the operations of S31-S33 are performed on all the picture pairs.

2. The method for segmenting human body clothes based on semantic consistency as claimed in claim 1, wherein the image data set in S1 comprises a single frame image I_trainAnd artificially labeled semantic segmentation graph P_train(ii) a The algorithm aims at predicting the semantic segmentation map of the clothes in the single-frame image

3. The human clothing segmentation method based on semantic consistency as claimed in claim 1, wherein the S2 comprises the following sub-steps:

s21, for each single frame image I_trainExtracting human body posture characteristic C from pre-trained human body posture estimation model Openpos_poseAnd image appearance characteristic C_appearanceBy cascading C_poseAnd C_appearanceObtaining a single frame image I_trainCorresponding Picture feature C_I；

wherein C is_emptyIs and C_IPicture features of one image with the same picture size and all values of 0; c'_IIs for single frame image I'_trainAccording to a single frame image I_trainPicture characteristics obtained by the same method; euclidean () represents the calculation of the Euclidean distance;

s23, calculating and comparing the similarity for each single frame image I_trainRetrieving the most similar picture I'_trainTo obtain a picture pair (I)_train，I′_train) And a corresponding similarity value alpha.

4. The human clothing segmentation method based on semantic consistency as claimed in claim 1, wherein the S4 comprises the following sub-steps: