CN106919951B

CN106919951B - Weak supervision bilinear deep learning method based on click and vision fusion

Info

Publication number: CN106919951B
Application number: CN201710059373.XA
Authority: CN
Inventors: 俞俊; 谭敏; 郑光剑
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2020-04-21
Anticipated expiration: 2037-01-24
Also published as: CN106919951A

Abstract

The invention discloses a weak supervision bilinear deep learning method based on click and vision fusion. The invention comprises the following steps: 1. extracting click features formed by texts of each image from the click data set, and constructing new low-dimensional compact click features in a merged text space by merging texts with similar semantics; 2. constructing a depth model fused with the click and visual features; 3. BP learning network model parameters; 4. calculating model prediction loss of each training sample, constructing a similarity matrix of the sample set, learning sample reliability by using the sample loss and the similarity matrix, and weighting the samples by using the reliability; 5. repeating steps 3 and 4, iteratively optimizing the neural network model and the sample weights, thereby training the entire network model until convergence. The method integrates click data and visual features to construct a new bilinear convolutional neural network framework, and can be used for better identifying fine-grained images.

Description

Weak supervision bilinear deep learning method based on click and vision fusion

Technical Field

The invention relates to a fine-grained image classification method, in particular to a weak supervision bilinear deep learning method based on click and vision fusion.

Background

Fine-Grained classification (FGVC) is a sub-problem for object recognition as a research direction. The method is used for distinguishing different subclasses in the same type of objects, related objects are extremely similar in general appearance, certain related prior knowledge is needed to distinguish the objects, the fact that the objects are distinguished by inexperienced people is not easy to achieve, and the fact that a computer can automatically classify the objects is more challenging.

In a research task for fine-grained image recognition, Tsung-Yu Lin et al, massachusetts university, proposes a Bilinear convolutional neural network model (BCNN), and by applying the model to the task of fine-grained image recognition, a very good effect is found to be achieved. The model is based on the contents of popular deep learning in recent years, and consists of two different CNN network frameworks, two features with different expression properties are obtained by performing different convolutions on one image, and a feature vector with more representation capability is obtained by combining outer products, so that a better identification effect for fine-grained images is realized.

Although BCNN has proven to be a very effective model in fine-grained image recognition in view of the current state of the art, it still suffers from a deficiency in exploiting the semantic information of images. Therefore, it is very urgent to design an effective semantic feature. Many researchers would like to compensate for this by manually labeling attributes, but this approach is less promising due to the excessive labor cost. To address this problem, Microsoft released a new large-scale click dataset, Clicktube. This Microsoft published click data set comes from the record of a commercial search engine, which consists of three parts: text query, clicked pictures and corresponding click quantity. The three parts jointly express the correlation between the query text and the picture of the user, and the click rate quantifies the degree of correlation between the corresponding picture and the text. With the help of the click data, the image can use each query text as an attribute to obtain a feature related to semantic information, and the click quantity represents the value of each dimension (namely, attribute) corresponding to the feature.

The click data set is used as data collected from the internet, and has the advantages of large data volume, low labor cost and better capability of expressing semantic information. The visual features extracted by the BCNN are taken as main bodies, semantic features brought by clicking data are matched, the effect of promoting fine-grained image classification is feasible, and the method is worthy of study. In addition, the click data is taken as the hot direction of the current scientific research, and the reasonable use also enables the invention to have certain frontier and innovation.

Disclosure of Invention

The invention provides a weak supervision bilinear deep learning method based on click and vision fusion, which fuses click data and vision characteristics to construct a new bilinear convolutional neural network framework and can be used for better identifying fine-grained images.

A weak supervision bilinear deep learning method based on click and vision fusion comprises the following steps:

step (1), click data preprocessing:

extracting click features formed by texts of each image from the click data set, and constructing new low-dimensional compact click features in a merged text space by merging texts with similar semantics;

step (2), constructing a depth model fused with clicking and visual features:

and (3) weighting the samples based on reliability, constructing a weighted three-channel deep neural network model, wherein two channels extract image visual features, and the third channel processes the click features in the step (1). Fusing the visual and click characteristics through a characteristic connection layer;

step (3), BP learning network model parameters:

and (3) training the network model parameters of the neural network in the step (2) through a back propagation algorithm until the whole network model converges.

Step (4), learning sample reliability:

calculating the model prediction loss of each training sample according to the neural network model in the step (2), constructing a similarity matrix of the sample set, learning the reliability of the samples by using the sample loss and the similarity matrix, and weighting the samples by using the reliability;

step (5), model training:

repeating steps 3 and 4, iteratively optimizing the neural network model and the sample weights, thereby training the entire network model until convergence.

Extracting click features corresponding to the images from the click data set and clustering and combining the click features according to meanings in the step (1), wherein the specific steps are as follows:

1-1, extracting text corresponding to the image i from the click data set to form click characteristics

The specific formula is as follows:

wherein c is_i,jIs the click rate for image i and text j.

1-2, in order to obtain short and compact feature vectors, reduce dimensions of click features so as to reduce calculated amount and solve problems of repeated text semantics and the like, clustering texts indirectly by using a K-means clustering method so as to obtain an index G of a text cluster, and adding click amounts of texts of the same class to obtain a new click feature u_iSpecifically, as shown in formula 2:

wherein G is_jRepresenting the jth text class.

Constructing a depth model fusing the click and the visual features, and connecting the visual features and the click features together, wherein the depth model is specifically as follows:

2-1, constructing a three-channel network frame structure W-C-BCNN, wherein the first two channels adopt a bilinear convolutional neural network to extract visual characteristics z of an image_iAnd (3) extracting the click feature u of the corresponding image obtained in the step (1) by the third channel_i(ii) a Then, the extracted visual features and click features are spliced together through a connecting layer, and a feature o with visual and semantic expression capabilities is output_iSpecifically, as shown in formula 3:

o_i＝(z_i,μu_i)＝(z_i,1,z_i,2,…,μu_i,1,μu_i,2…) (equation 3)

Where μ represents a weight parameter.

2-2. given n training data (I)_i,y_i) Wherein y is_i∈[1,2,...,N]Class labels representing each data, and obtaining network model parameters theta and sample reliability variables w by solving the weak supervision bilinear deep learning problem^*Thus, the whole network model is trained until convergence, as shown in formula 4:

therein, rightWeight w^*Representing the reliability of the training sample obtained after optimization, w representing the weight before optimization, particularly, when the weight is always 1, the network framework is called as C-BCNN, and since the weight is obtained by learning in continuous iterative optimization, the network framework is called as a weak supervised learning problem; p (w) is a weighted prior term, which is estimated based on the click amount of the click data by modeling, as shown in formula 5:

wherein

Is the normalized click vector; t (-) is an objective function of scale transformation, controlling w^cA logarithmic transformation function of the scale range, which is used for processing the condition that the click number of the picture is unbalanced; s (G, w) is a smoothing term, and is based on the assumption of visual consistency of the image (i.e. weights are close when visual features are close), so as to perform regularization on the image, as shown in formula 6:

wherein g is_i,jRepresenting the values in the sample similarity matrix G, the graph is computed and constructed using the similarity of the depth vision features z.

Training the network model parameters by using a back propagation algorithm until convergence in the step (3), which is specifically as follows:

3-1 training by using a back propagation algorithm to obtain a model parameter theta, let

As the gradient of the loss function to the input, a back propagation formula for the two deep nets a and B can be obtained according to the chain rule, as shown in formula 7:

wherein the content of the first and second substances,

and (4) learning the reliability variable w of the sample by using the sample loss and the similarity matrix^*The method comprises the following steps:

4-1, extracting the softmax loss value of any training sample i in the network constructed based on the step (2) through inputting the data into the network for calculation

4-2, by fixing theta, converting the formula 4 into an optimization problem for solving the following quadratic programming, and learning to obtain a sample reliability parameter, which is specifically shown in a formula 8:

wherein I represents a unit vector, E represents a unit matrix, and L_lapThe laplacian matrix representing G is specifically defined as shown in equation 9:

iteratively optimizing the model parameters and the sample weights until convergence in the step (5), wherein the specific process is as follows:

and 5-1, according to the weak supervised learning problem, iteratively optimizing the steps 3 and 4 in two steps in a variable control mode, so as to train the whole network model until convergence: 1) each weight w is fixed_iLearning by solving the problem of W-C-BCNN to obtain a network model parameter theta; 2) fixing each theta, converting the formula 4 into quadratic programming, and learning to obtain a sample reliability variable w^*。

The invention has the beneficial effects that:

the method integrates click data and visual features to construct a bilinear convolutional neural network framework, improves the defect that the conventional single visual feature is used for identifying the image, not only can obtain the feature with more representation capability by simultaneously capturing visual and semantic information of the image, but also can automatically weight training data based on the reliability of a data sample, and improves the effect of fine-grained image identification; in addition, the click data is taken as a current research hotspot, and the reasonable use also enables the invention to have more advanced and innovative scientific research.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of a network framework constructed in the method of the present invention.

FIG. 3 is a schematic diagram of network model training for the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, a weak supervised bilinear deep learning method based on click and visual fusion specifically includes the following steps:

1-1 to meet the experimental needs, we presented all Dog-related samples separately from the click data set Clickture available from microsoft, forming a new data set Clickture-Dog. The data set has 344 types of dog pictures, and we filter the types with the number of pictures less than 5, and finally obtain 283 groups of pictures. The data set was then segmented into training, validation, and testing in a 5: 3: 2 fashion. To improve the imbalance of the number of pictures in each class during training, we will choose more than 300 classes from which only 300 are randomly selected for training.

1-2, extracting text corresponding to the image i from the click data set click-Dog to form click characteristics

Specifically, as shown in equation 1, the length is 48 ten thousand.

1-3, in order to obtain short and compact feature vectors, reduce dimensions of click features to reduce calculated amount and solve problems of text semantic repetition and the like, clustering texts indirectly by using a K-means clustering method to obtain an index G of a text cluster, adding click amounts of texts of the same class to obtain new click features, specifically shown in formula 2, and finally obtaining the length of the click features of 4318 dimensions.

2-1, constructing a three-channel network frame structure W-C-BCNN, as shown in figure 2, wherein the first two channels adopt a bilinear convolutional neural network to extract visual characteristics z of an image_iThe two channels respectively adopt VGG-M and VGG-16, the obtained visual feature length is 512 x 512 dimensions, and the third channel extracts the click feature u of the corresponding image obtained in the step (1)_i(ii) a Then splicing the extracted visual features and click features through a connecting layer, wherein the connecting layer is specifically shown as a formula 3; wherein, setting mu in the formula as 1, adding a dropout layer after the network characteristic connection layer, and setting the parameter value as 0.1, namely, keeping the value of 0.1.

2-2. for a given number n of training data (I)_i,y_i) Wherein y is_i∈[1,2,...,N]Class labels representing each data, network model parameters theta and sample reliability variables w obtained by solving the problem of weakly supervised learning^*Specifically, the formula 4 is shown. When the weight w^*When the value is always set to 1, the network effect of the C-BCNN is obtained through experiments; when the weight w^*The initial value is set to 1, and when iterative optimization is continuously learned, the network effect of the W-C-BCNN is obtained through experiments.

2-3, for α and β in formula 4, we select a series of specific parameter values, wherein α e (0.01,0.1,1,10), β e (0.001,0.01,0.1,1,10), and the experiment shows that the group with the best effect is α -0.1 and β -1.

2-4, aiming at the similarity matrix G in the formula 6, the similarity matrix G is calculated and constructed according to the similarity of the depth visual feature z, and the depth visual feature is extracted by a VGG network.

3-1. As shown in FIG. 3, model parameters θ are obtained by using back propagation algorithm training, so that

As the gradient of the loss function to the input, a back propagation formula for the two deep nets a and B can be obtained according to the chain rule, as shown in formula 6.

And 4-2, converting the formula 4 into an optimization problem for solving quadratic programming by fixing theta, and learning to obtain a sample reliability parameter, wherein the sample reliability parameter is specifically shown in a formula 8, and can be obtained by calculating a formula 6 for G in a formula 9.

and 5-1, according to the weak supervised learning problem, iteratively optimizing the steps 3 and 4 in two steps in a variable control mode, so as to train the whole network model until convergence: 1) each weight w is fixed_iLearning by solving the problem of W-C-BCNN to obtain a network model parameter theta; 2) fixing each theta, converting the formula 3 into quadratic programming, and learning to obtain a sample reliability variable w^*。

5-2, testing a network model: for the learned weight vector, a threshold value (2 in the experiment) is set for the learned weight to control the range, and the part of the weight exceeding the threshold value is evenly assigned to the corresponding term. We compared the effect achieved by this method with other methods and the results are shown in table 2. In addition, in order to improve the computational efficiency, the maxporoling method is adopted to shorten the dimension of the visual features to 4096 dimensions, and then the comparison of the recognition accuracy is uniformly carried out under the standard.

Table 1 compares the recognition accuracy of C-BCNN to BCNN, and the improved ratio.

Model (model)	BCNN	C-BCNN	Ratio
				Accuracy (%)	33.20	50.80	53％

Table 2 shows the comparison of the recognition accuracy between C-BCNN and W-C-BCNN, showing the effect under different treatments of weights, wherein W-C-BCNN (T) is a method for controlling the range of weight vector, and W-C-BCNN is a method for not controlling the range.

Method of producing a composite material	C-BCNN	W-C-BCNN	W-C-BCNN(T)
				Accuracy (%)	47.10	48.90	48.90

Claims

1. A weak supervision bilinear deep learning method based on click and vision fusion is characterized by comprising the following steps:

step (1), click data preprocessing:

step (2), constructing a depth model fused with clicking and visual features:

weighting the sample based on reliability, and constructing a weighted three-channel deep neural network model, wherein two channels extract image visual features, and the third channel processes the click features in the step (1); fusing the visual and click characteristics through a characteristic connection layer;

step (3), BP learning model parameters:

training the network model parameters of the neural network in the step (2) through a back propagation algorithm until the whole network model converges;

step (4), learning sample reliability:

step (5), model training:

repeating steps (3) and (4), iteratively optimizing the neural network model and the sample weights, and thus training the whole network model until convergence;

The specific formula is as follows:

wherein c is_i,jThe click rate corresponding to the image i and the text j;

1-2, in order to obtain short and compact feature vectors, reducing dimensions of click features so as to reduce calculated amount and solve the problem of repeated text semantics, a K-means clustering method is used for indirectly clustering texts so as to obtain an index of text clustering

Adding the click quantities of the texts in the same class to obtain a new click characteristic u_iSpecifically, as shown in formula 2:

wherein

Represents the jth text class;

2-1, constructing a three-channel network frame structure W-C-BCNN, wherein the first two channels adopt a bilinear convolutional neural network to extract visual characteristics z of an image_iAnd (3) extracting the click feature u of the corresponding image obtained in the step (1) by the third channel_i(ii) a However, the device is not suitable for use in a kitchenThen the extracted visual features and click features are spliced together through a connecting layer, and a feature o with visual and semantic expression capabilities is output_iSpecifically, as shown in formula 3:

o_i＝(z_i,μu_i)＝(z_i,1,z_i,2,…,μu_i,1,μu_i,2…) (equation 3)

Wherein μ represents a weight parameter;

2-2. given n training data

Wherein y is_i∈[1,2,...,N]Class labels representing each data, and obtaining network model parameters theta and sample reliability variables w by solving the weak supervision bilinear deep learning problem^*Thus, the whole network model is trained until convergence, as shown in formula 4:

wherein the weight w^*Representing the reliability of the training sample obtained after optimization, w representing the weight before optimization, particularly, when the weight is always 1, the network framework is called as C-BCNN, and the weight is obtained by learning in continuous iterative optimization, so that the problem of weak supervised learning is called; p (w) is a weighted prior term, which is estimated based on the click amount of the click data by modeling, as shown in formula 5:

wherein

Is the normalized click vector; t (-) is a scale transformationObjective function, control w^cA logarithmic transformation function of the scale range, which is used for processing the condition that the click number of the picture is unbalanced; s (G, w) is a smoothing term, and is an assumption of visual consistency of the image, so as to perform regularization on the image, as shown in formula 6:

wherein g is_i,jRepresenting the values in the sample similarity matrix G, the depth model is computed and constructed using the similarities of the visual features z.

2. The weakly supervised bilinear deep learning method based on click and vision fusion as claimed in claim 1, wherein the network model parameters are trained by using a back propagation algorithm until convergence in step (3), specifically as follows:

3-1, training by using a back propagation algorithm to obtain a model parameter θ, and taking dl/dx as a gradient of a loss function pair input, then obtaining a back propagation formula about two deep networks a and B according to a chain rule, as shown in formula 7:

wherein the content of the first and second substances,

3. the weakly supervised bilinear deep learning method based on click and vision fusion as claimed in claim 2, wherein the step (4) of learning the reliability variable w of the sample by using the sample loss and the similarity matrix^*The method comprises the following steps:

4-1, extracting the softmax loss value l (y) of any training sample i by inputting the data into the network constructed based on the step (2) for calculation_i,o_i)；

4. the weakly supervised bilinear deep learning method based on click and vision fusion as claimed in claim 3, wherein the model parameters and the sample weights are iteratively optimized until convergence in step (5), and the specific process is as follows: