CN110516098A

CN110516098A - Image labeling method based on convolutional neural networks and binary coding feature

Info

Publication number: CN110516098A
Application number: CN201910791484.9A
Authority: CN
Inventors: 薛越; 王邦军; 吴新建; 张莉
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-11-29

Abstract

The invention discloses a kind of image labeling method based on convolutional neural networks and binary coding feature, comprising the following steps: building Incepiton V3 basic network model；Intercept the last pond layer of Incepiton V3 network foundation model, remove Logits the and softmax function of Incepiton V3 network foundation model, and sigmoid function is used as the activation primitive of the last layer, obtains modified first foundation network model；It added two layers of full articulamentum on first foundation network model, and use sigmoid function as the activation primitive of the last layer, obtain multi-tag sorter network model；Study is trained to training set using multi-tag sorter network model, optimizes the weight of multi-tag sorter network model；Based on the set of eigenvectors of trained multi-tag sorter network model label target image, the multi-tag probability output of target image is obtained；In conjunction with multi-tag probability output, target image is labeled using TagProp algorithm.Its multi-tag mark that can be realized image, it is at low cost, it is high-efficient.

Description

Image labeling method based on convolutional neural networks and binary coding feature

Technical field

The present invention relates to visual pattern technical fields, and in particular to one kind is special based on convolutional neural networks and binary coding The image labeling method of sign.

Background technique

In order to realize that effective management and retrieval of large-scale image, the efficient mark of image seem more important.Image mark The target of note is to distribute one group of relevant descriptive label for image.Traditional image labeling algorithm requires to take a significant amount of time Manual extraction characteristics of image, and may take less than good effect, therefore deep learning is applied on image labeling.It is deep Degree study can obtain the semantic feature of image higher, this reduces the difference with this high level semantic-concept of label.It is based on The Automatic image annotation algorithm of deep learning is not necessarily to manual extraction characteristics of image, so that dimensioning algorithm is no longer limited by feature extraction The selection of method, and its training method greatly improves annotating efficiency end to end.Convolutional neural networks are as depth Popular model in habit has many advantages, such as to be suitble to processing high dimensional data, good classification effect, carries out image labeling using it, can be with Obtain preferably mark effect.

The work of traditional images mark, which is all based on, to be carried out on the basis of single labeling, every image assume that with One label correlation.And in a practical situation, an image is often associated with multiple labels, and single label can not completely describe whole Open image.Current convolutional neural networks model is also all based on single label image classification task building, and loss function is logical It is often Softmax function, a label of maximum probability can only be assigned to image.It, must in order to be labeled to multi-tag image More particularly suitable method must be found.In addition, Automatic image annotation algorithm is increasingly focused on arriving millions of images in this way hundreds of thousands of Large database on application.Therefore, it is necessary to consider time cost, more succinct and efficient character representation method is explored.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of figure based on convolutional neural networks and binary coding feature As mask method, the multi-tag mark of image can be realized, it is at low cost, it is high-efficient.

In order to solve the above-mentioned technical problems, the present invention provides one kind to be based on convolutional neural networks and binary coding feature Image labeling method, comprising the following steps:

Construct Incepiton V3 basic network model；

The last pond layer of the Incepiton V3 network foundation model is intercepted, the Incepiton V3 network is removed Logits the and softmax function of basic model, and use sigmoid function as the activation primitive of the last layer, it obtains Modified first foundation network model；

It added two layers of full articulamentum on the first foundation network model, and use sigmoid function as most The activation primitive of later layer obtains multi-tag sorter network model；

Study is trained to training set using the multi-tag sorter network model, optimizes the multi-tag sorter network The weight of model；

Based on the set of eigenvectors of trained multi-tag sorter network model label target image, the target figure is obtained The multi-tag probability output of picture；

In conjunction with the multi-tag probability output, the target image is labeled using TagProp algorithm.

Preferably, described " be trained study to training set using the multi-tag sorter network model, optimize institute State the weight of multi-tag sorter network model ", it specifically includes:

Study is trained to training set using the multi-tag sorter network model, obtains loss function；

Training is finely adjusted to the multi-tag sorter network model according to described；Wherein, the specific packet of fine tuning training Include: the weight of the convolutional layer before fixed two layers of full articulamentum passes through described two layers of optimization full connection of backpropagation training Layer.

Preferably, described " in conjunction with the multi-tag probability output, using TagProp algorithm to the target image It is labeled ", it specifically includes:

For target image x, possess j-th of label i.e. y_j=+1 probability are as follows:

Wherein, π_iIndicate the weight of prediction label, p (y_j=+1 | x_i) indicate that Target Photo has j-th under conditions of xi The probability of label, it may be assumed that

Wherein ε is predetermined value；

It is solved by maximizing the log-likelihood of label in training set, then the loss function of model are as follows:

L=∑_jc_jlogp(y_j)

Wherein, parameter c_jFor measuring the loss that image X belongs to label j.

Preferably, calculating weight π based on the method for distance_i, i.e.,

Wherein, d_hi(x, x_i)=h_iD (x, x_i), d (x, x_i) be x and xi fundamental distance.

Preferably, using sigmoid function to "" improve, i.e. p (y_j=+1) =σ (α_jz_j+β_j), wherein α_jFor weight, β_jFor biasing, z_jIt is the weighted average of label j in the neighbour of target image X, i.e. z_j =∑_iπ_iy_j。

Preferably, " the parameter c_jFor measuring the loss that image X belongs to label j ", it specifically includes:

Work as y_jWhen=+ 1, c_j=1/N+；Work as y_jWhen=- 1, c_j=1/N-；Wherein N+ indicates to belong to label j's in training set The number of picture, N- indicate the number that the picture of label is not belonging in training set, i.e.,

Beneficial effects of the present invention: the invention proposes a convolutional neural networks figures based on Sigmoid loss function As marking model, by one new network model of building and the binary coding feature of image is extracted, then uses TagProp Algorithm carries out image labeling, can be realized the multi-tag mark of image, at low cost, speed is fast, high-efficient.

Detailed description of the invention

Fig. 1 is the schematic diagram of Inception network model；

Fig. 2 is InceptionV3 schematic network structure；

Fig. 3 is flow diagram of the invention；

Fig. 4 is CNN-Sigmoid and CNN-Softmax contrast schematic diagram, wherein (a) is Natural Scenes data Experimental result on collection；It (b) is experimental result on Corel5K data set.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.

The characteristics of present invention uses Inception V3 as the basic network topology of model, Inception network is control While having made calculation amount and parameter amount, extraordinary classification performance is had also obtained.Inception network is not to increase simply The number of plies of network, it proposes Inception Module, and structure is as shown in Figure 1, be the signal of Inception network model Figure.This modular design reduces the number of parameters of network, and reduces the design space of network, while increasing network Thickness.Inception network model also introduces the thought of decomposition, replaces a big convolution using two small convolution, further Reduce parameter amount, while adding the non-linear of network.

Inception V3 mainly to two aspect be transformed, be on the one hand the introduction of the thought of decomposition, by 7 × 7 this The big convolution of sample resolves into 1 × 7,7 × 1 two convolution, and 5 × 5 convolution are also similarly decomposed (1 × 5,5 × 1), so both Reduce number of parameters, and because 1 convolution is splitted into 2 convolution, has deepened the depth of network, enhanced the non-linear of network With the ability to express of model.It on the other hand is that Inception V3 optimizes Inception module, so that Inception module There are 35 × 35,17 × 17 and 8 × 8 three kinds of different structures.And these modules only network behind occur, before or commonly Convolutional layer.As shown in Fig. 2, being InceptionV3 schematic network structure.

Referring to shown in Fig. 3, the invention discloses a kind of image mark based on convolutional neural networks and binary coding feature Injecting method.

The present invention finely tunes Inception V3 network model.In order to preferably adapt to multi-tag classification, the present invention The last pond layer of Inception V3 has been intercepted, Logits and softmax function is eliminated, added two layers of full articulamentum, And use sigmoid function as the activation primitive of the last layer.Two kinds of activation primitives can be applied to situations of classifying more, The difference is that multiclass may overlap with each other in sigmoid, softmax be then it is all kinds of mutually exclusive, this allows for sigmoid Function is more suitable for multi-tag classification.

Assuming that there is training set I={ I₁,I₂,···,I_N, I_i∈R^m×n, m and n are the height and width of picture.In multi-tag In habit, every picture has multiple labels.If the number of tags of data lump is c, the label vector of i-th image is yi.Due to making With sigmoid function, then the probability that i-th image has j-th of label isWherein f_j(I_i) it is net J-th of unit that the last layer exports in network.For label, if i-th image has j-th of label, y_ij=1, otherwise y_ij =0, then true probability is p_ij=y_ij/‖y_i‖1.Loss function can be with as a result, is defined as:

In order to accelerate the training speed of model, the present invention is finely adjusted training to network model, i.e., fixed front convolutional layer Weight, pass through two layers of full articulamentum adding of backpropagation training optimization.

The present invention extracts the feature X={ x1, x2, xN } of the output vector of full articulamentum as image, corresponding Tag set is Y={ y1, y2, yN }.

For the high-level semantics features for preferably CNN being utilized to extract, the present invention applies features to TagProp image labeling In model.TagProp is the model based on arest neighbors, the method that it uses neighbour's ballot, it is contemplated that with target image vision Different weights should be occupied in ballot apart from different pictures, it is a series of that the acceptance of the bid of final goal image, which checks out existing probability, Weighted sum.

The condition that TagProp learns the label of given image by calculating pairwise distance is distributed.For target image x, It possesses j-th of label i.e. y_j=+1 probability are as follows:

Wherein π i indicates that sample xi is the weight of x prediction label, p (y_j=+1 | xi) indicate target figure under conditions of xi Piece has the probability of j-th of label, is defined as:

Wherein ε is the value of very little, is 0 to avoid probability.We are using weight π i is calculated based on the method for distance, i.e.,

Wherein d_hi(x, x_i)=h_iD (x, x_i).D (x, x_i) it is x and x_iFundamental distance, the present invention use Euclidean distance come Measure the distance of two images, h_i>=0 can optimize.The probability occurred in target image x by calculating all labels, Take wherein the maximum k label of probability value as final mark.

L=Σ_jc_jlogp(y_j) (formula 5)

Parameter c_iIt is different to allow for the picture number that each tag concept includes, for measuring the damage that image X belongs to label j It loses.Specifically, work as y_jWhen=+ 1, c_j=1/N⁺；Work as y_jWhen=- 1, c_j=1/N^-.Wherein N⁺, N^-It respectively indicates in training set and belongs to In the number with the picture for being not belonging to label j, i.e.,Limitation h in element be all it is non-negative, utilize throwing The value of shadow gradient algorithm solution h.

It will be understood that semantic concept lower for those frequencies of occurrences, even if occurring in neighbour several times, also only It can obtain lower prediction probability, therefore be had on the semantic concept that these occupy picture rareness by the above method lower Recall value.Therefore to y predicted above_j=+1 (formula 3) improves, and is carried out smoothly, i.e., using sigmoid function

p(y_j=+1)=σ (α_jz_j+β_j) (formula 6)

z_jIt is the weighted average of label j in the neighbour of target image X, i.e.,For weight, β_jIt is inclined It sets.Here sigmoid function is the weight enhancing in order to make the label of relative rarity, to weaken the higher label of the frequency of occurrences Weight.

Based on above-mentioned image labeling method, specific practical different data collection carries out test analysis, and situation is as follows:

The present invention is assessed on Natural Scenes, Corel5K, ESP-Game, multiple data sets such as IAPRTC-12 and is mentioned The validity of model out, each data set information are shown in Table 1.

These data sets are briefly introduced separately below:

(1) data set

The present invention is assessed on Natural Scenes, Corel5K, ESP-Game, multiple data sets such as IAPRTC-12 and is mentioned The validity of model out, each data set information are shown in Table 1.These data sets are briefly introduced separately below.

Each data set information of table 1

Image set	Picture number	Label	Average label	Training image number	Test image number
						Natural	2000	5	2.3	1500	500
Corel5K	5000	260	3.4	4000	1000
						ESP-Game	20000	268	4.7	15000	5000
IAPRTC 12	19627	291	5.7	15000	4627

Natural Scenes data set scale is smaller, there is 2000 images, and it is sunset respectively that these images, which have been divided into 5 classes, Sun, desert, forest, ocean and mountain, and every image has 1~2 label.

The picture amount of Corel-5K data set is medium, it one shares 5000 images, contain weather, scenery, building and 260 kinds of labels such as vehicles, every image has 1~5 label, and the average number of tags of every image is 3.5.Due to The label information of Corel-5K data set is all more accurate, and often it is often used in the classification experiments of various multi-tag images.

ESP-Game image set is larger, and one shares 20770 images.It one shares 268 kinds of labels, and label covers Capping is very wide, including drawing, building, animal etc..Every image has 1~15 label, and average every image has 4.6. ESP-Game data set picture amount is larger, and there are some error labels, so this chapter has done some processing to it, has gone some The image of label information inaccuracy shares 20000 eventually for the image of experiment.

IAPRTC-12 is equally the more data set of a picture amount, altogether includes 19627 width images.Its label is With the sentence with practical significance of various language descriptions.Major terms can be extracted by using natural language processing technique, It is converted into format similar with other data sets.Eventually passing through statistics, it one shares 291 kinds of marks, and every image is averaged Number of tags is 5.7.And the large percentage that the mankind's image caning be found that in IAPRTC-12 image accounts for.

(2) experimental setup and evaluation index

It tests and is carried out on the server equipped with GPU, server system is Ubuntu 16.04, is furnished with 2 pieces of NVIDIA GeForce TITAN video card.Experiment uses TensorFlow deep learning frame, programming language Python.

As said before, the network model that this model uses is the Inception V3 net that pre-training is good on ImageNet Network.When finely tuning to the network model, to the two lesser data sets of Natural Scenes and Corel-5K, learning rate is set It is set to 0.0001, for the two biggish data sets of ESP-Game and IAPRTC-12, learning rate is set as 0.0005.Learning rate Exponential damping to be all set as 0.99995, mini-batch be all 32, dropout is 0.5.In addition, constructing k set of tags At image tag candidate collection when, 5, ESP- are taken for Natural Scenes data set k=2, Corel-5K data set k Game data set k takes 6, and IAPRTC-12 data set k takes 7.

Icon mark belongs to multi-tag study, therefore present invention introduces the evaluation indexes of some multi-tags classification.For one A test set S={ (x₁, Y₁), (x₂, Y₂) ..., (x_p, Y_p), wherein Y is tag set.

1, Hamming loss (HL):Wherein Q is of label in sample set Number, h (x_i) it is expressed as the tag set of sample i prediction, Δ is xor operation.HL can be used to assess a sample more by mistake point Few time, for example, a sample is not belonging to label A but is divided into label A by mistake, either, a sample belongs to label A, still It is not predicted to be label A.It may also be said that calculated with hamming loss result sequence that classifier predicts and result sequence it Between distance numerically.HL value is smaller, and prediction result is better.

2, One-error (OE):

f(x_i, y) and it is prediction score of the sample i for label y.What OE was indicated is the label for exporting highest scoring in result The not probability in true tag set.Therefore OE value is smaller, and prediction result is better.

3, Coverage (C):rankf(x_i, y) and table Showing and is ranked up according to the probability of sample label prediction, true tag also and then sorts,It indicates The position for the label that the last one label is 1 in true tag sequence after drained sequence.It is averagely also poor more that coverage rate evaluates us Far, indicate in all documents that (rank is since 1 ing, so back subtracts for the sequence average value of true tag of the sequence after It is a 1), it is better with the smaller performance of sample value.

4, Ranking loss (RL):

F is pre- Survey function, Y^-I is the supplementary set of Yi, | Yi | indicate the quantity of sample i physical tags.Sequence loss is used to indicate in sort result In, the sample for being not belonging to respective labels concentration has been come being averaged for the probability for belonging to respective labels concentration sample.RL value is smaller, Prediction result is better.

5, Average precision (AP):。 It illustrates that for each prediction result, the label of prediction is correct and the forward probability that sorts in result set.AP value Bigger, prediction effect is better.

In addition to this, the present invention additionally uses most common several indexs to measure the performance of image labeling method, respectively It is accuracy rate P (Precision), recall rate R (Recall), F1 value and N+.For a certain label i, accuracy rate calculate be by Ratio of the image correctly marked in the actually image that mark, recall rate calculating is the image that is correctly marked should The ratio in image being marked.Assuming that marking correct picture number and beingAll picture numbers retrieved areTest set In all picture numbers relevant to the keyword beThen have

F1 value is to balance the index of accuracy rate and recall rate, is hadIn addition, N+ indicates correct in all labels The number of labels of mark, this index reflect algorithm to the level of coverage of label.

(3) experimental result

In order to verify the performance of proposed image labeling model, the present embodiment is right in the indexs such as HL, OE, C, RL, AP first Model is evaluated；Then the model performance whether there is or not self-encoding encoder is analyzed respectively, by the effect with other models into Row comparison, embodies the validity of institute's climbing form type；The image retrieval and mark of cross datasets are finally carried out between two datasets, Embody the generalization of institute's climbing form type.

A, the model performance of multi-tag classification angle

From multi-tag classification angle, the multi-tags such as HL, OE, C, RL, AP point are used on Natural Scenes data set Class index evaluates model.Binary coding feature and non-binary code feature are verified respectively, compared Method is current popular multi-tag learning method, and such as ML-KNN, ML-I2C, InsDif and ML-LI2C, table 2 is illustrated The result of model.It can be seen that the model proposed is all improved in 5 all indexs compared to model before, and And the effect that two kinds of features obtain is similar.

The Contrast on effect of 2 climbing form types of table and other methods on Natural Scenes data set

Method	HL↓	OE↓	C↓	RL↓	AP↑
						ML-KNN	0.169	0.3	0.93	0.168	0.80
ML-I2C	0.159	0.311	90.88	0.156	30.80
						InsDif	0.152	0.259	30.83	0.14	40.83
ML-LI2C	0.129	0.19	40.62	0.091	0.88
						InceptionV3	0.101	0.15	40.55	0.076	10.90
Inception	0.107	0.157	30.56	0.08	80.90

B, the model performance on multiple data sets

This part is by the way that in Corel5K, ESP-Game is real on IAPRTC-12 data set with other image labeling methods It tests and compares to verify the validity of our methods, index P, R, F1 and N+.

It is tested on two lesser data sets of picture amount of Natural Scenes and Corel5K first, with CNN+ Softmax method is compared.In order to accurately compare, the feature of use is all directly extracted from institute's climbing form type. It can see from Fig. 4, the mark effect of both of which ratio CNN+Softmax of the invention is more preferable, and CNN-TagProp effect It is better than CNN-TagProp (256bit).Specifically, the F1 value of CNN-Sigmoid method exists compared to CNN-Softmax method 6% and 8% have been respectively increased in two datasets.It is effective that this, which illustrates that last loss function is changed to sigmoid by the present invention, , sigmoid is more suitable for multi-tag compared to softmax and marks.The high-level semantic for also embodying mentioned model extraction simultaneously is special Sign has preferable discrimination, is conducive to image labeling.

Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims

1. a kind of image labeling method based on convolutional neural networks and binary coding feature, which is characterized in that including following Step:

Construct Incepiton V3 basic network model；

The last pond layer of the Incepiton V3 network foundation model is intercepted, the Incepiton V3 network foundation is removed Logits the and softmax function of model, and sigmoid function is used to be modified as the activation primitive of the last layer First foundation network model afterwards；

It added two layers of full articulamentum on the first foundation network model, and use sigmoid function as last The activation primitive of layer obtains multi-tag sorter network model；

Study is trained to training set using the multi-tag sorter network model, optimizes the multi-tag sorter network model Weight；

Based on the set of eigenvectors of trained multi-tag sorter network model label target image, the target image is obtained Multi-tag probability output；

2. image labeling method as described in claim 1, which is characterized in that described " to use the multi-tag sorter network mould Type is trained study to training set, optimizes the weight of the multi-tag sorter network model ", it specifically includes:

Training is finely adjusted to the multi-tag sorter network model according to described；Wherein, the fine tuning training specifically includes: Gu The weight of convolutional layer before fixed two layers of full articulamentum optimizes two layers of full articulamentum by backpropagation training.

3. image labeling method as described in claim 1, which is characterized in that described " in conjunction with the multi-tag probability output, to adopt The target image is labeled with TagProp algorithm ", it specifically includes:

Wherein, π_iIndicate the weight of prediction label, p (y_j=+1 | x_i) indicate in x_iUnder conditions of Target Photo have j-th of label Probability, it may be assumed that

Wherein ε is predetermined value；

L=∑_jc_jlog p(y_j),

Wherein, parameter c_jFor measuring the loss that image X belongs to label j.

4. image labeling method as claimed in claim 3, which is characterized in that calculate weight π based on the method for distance_i, i.e.,

Wherein, d_hi(x,x_i)=h_id(x,x_i), d (x, x_i) it is x and x_iFundamental distance.

5. image labeling method as claimed in claim 3, which is characterized in that use sigmoid function pairIt improves, i.e. p (y_j=+1)=σ (α_jz_j+β_j), wherein α_jFor weight, β_jTo bias, z_jIt is the weighted average of label j in the neighbour of target image X, i.e. z_j=∑_iπ_iy_j。

6. image labeling method as claimed in claim 3, which is characterized in that " the parameter c_jBelong to mark for measuring image X Sign the loss of j ", it specifically includes:

Work as y_jWhen=+ 1, c_j=1/N+；Work as y_jWhen=- 1, c_j=1/N-；Wherein N+ indicates the picture for belonging to label j in training set Number, N- indicate the number that the picture of label is not belonging in training set, i.e.,