CN110598776A

CN110598776A - Image classification method based on intra-class visual mode sharing

Info

Publication number: CN110598776A
Application number: CN201910830812.1A
Authority: CN
Inventors: 谢昱锐; 刘甲甲
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2019-12-20

Abstract

The invention provides an image classification method based on intra-class visual mode sharing, which comprises the following steps: generating an image object window; extracting depth features of an image window; visual dictionary learning based on in-class sharing characteristics: according to the depth characteristics of the candidate object windows of all semantic category images, a structured visual dictionary with in-class sharing characteristics is obtained by optimizing a visual dictionary learning model; generating an input image object window and extracting characteristics; feature coding of the object window; integrating visual features and constructing image global features; and predicting semantic tags of the SVM classifier. The method analyzes and solves the problem from the perspective of introducing the sharing characteristic of the visual mode with more practical value, realizes the enhancement of the semantic expression of the image characteristics and improves the accuracy of the identification of the image types by cooperatively mining the visual dictionary words with the sharing characteristic in each semantic type.

Description

Image classification method based on intra-class visual mode sharing

Technical Field

The invention relates to the technical field of image recognition, in particular to an image classification method based on intra-class visual mode sharing.

Background

With the current continuous development of digital multimedia technology and internet technology, human society has stepped into a big data age in which multimedia data is rapidly increasing. Among multimedia data of different forms, image data plays an important role in various aspects of people's social life due to the characteristics of intuition and easy acquisition, and further, how to effectively analyze and understand the content of the image data becomes increasingly important. In the past years, a plurality of image semantic object classification methods make certain progress in visual feature generation, object model construction and strong supervision learning modes. However, due to the semantic gap between the bottom-layer visual features and the middle-layer and high-layer information, the existing image semantic object classification method is still slow in the key problems of discriminant feature construction, associated information collaborative analysis, visual feature semantics and the like.

For the image classification problem, the current research focus is mainly on the construction aspect of semantic representation of image features. When the image feature representation can fully describe the semantic content of the object, the accurate prediction of the semantic content of the image can be realized only by a simple linear classifier. The acquisition of early image features is based on underlying visual cues such as color, shape, texture and the like, and histogram representation of visual information is generated through an artificially defined image feature construction mode. However, the bottom-layer visual feature representation is only a statistical description of visual information, and it is difficult to effectively depict semantic object contents, which eventually leads to an inability to accurately predict the category to which the image belongs in an actual classification task. In order to solve the above problems, subsequent research efforts are focused on extracting image feature representations with more semantic discriminative properties by means of machine learning. In a plurality of image classification models, a visual dictionary learning-based method decomposes the construction problem of image semantic feature representation into four sub-problems of bottom layer feature extraction, visual dictionary learning, local feature coding and image global feature generation.

According to the current image classification method based on visual dictionary learning, the obtained dictionary words are mutually independent, the exploration of the correlation among the dictionary words is lacked, and the judgment capability of the visual dictionary for constructing image feature representation is weakened. In fact, in the process of learning the visual dictionary, visual dictionary words with relevance are cooperatively mined, so that the consistency of feature representation of images of the same semantic category and the difference between feature representations of images of different semantic categories can be effectively enhanced, and the performance of image semantic object category prediction is finally improved.

Disclosure of Invention

Aiming at the defects in the prior art, the technical problem to be solved by the invention is to provide an image classification method based on intra-class visual mode sharing so as to solve the problem of insufficient semantic information represented by image features.

The technical scheme adopted by the invention for realizing the purpose is as follows: an image classification method based on intra-class visual mode sharing comprises the following steps:

image object window generation: giving an image training set containing multiple semantic class objects, and generating a candidate object window of each image in the image training set;

extracting depth features of an image window: extracting the depth feature of the candidate object window;

visual dictionary learning based on in-class sharing characteristics: according to the depth characteristics of the candidate object windows of all semantic category images, a structured visual dictionary with in-class sharing characteristics is obtained by optimizing a visual dictionary learning model;

generating candidate object windows of the image for the input image with unknown semantic category, and extracting the depth characteristics of the candidate object windows;

calculating the characteristic codes of candidate object windows of the input images according to the structured visual dictionary;

combining object window feature codes based on the feature codes of all object windows of the input image to construct an image global feature representation;

and predicting semantic category labels of the input images by utilizing a linear SVM classifier according to the image global feature representation to realize the classification of the images.

And the generation of the candidate object window of each image in the image training set is realized by an EdgeBox algorithm.

The extracting of the depth features of the candidate object window is completed through a VGG19 deep network model.

The optimized visual dictionary learning model is performed by the following formula:

in the above formula, X_iA visual characteristic matrix of all training samples corresponding to the ith semantic object class; d_∈iRepresenting a class-specific visual dictionary in which dictionary words corresponding to the ith class in the structured visual dictionary D are reserved, and the rest semantic object class dictionary words are set to zero; a. the_iIs a visual feature matrix X_iClass specific dictionary D_∈iA matrix of representation coefficients of (a); d is a structured visual dictionary to be optimized, which is a set of dictionary words of all semantic object categories; z_iIs a feature matrix X_iRepresenting coefficients on a structured visual dictionary D; d_iAnd D_jRespectively representing visual dictionaries corresponding to the ith and jth class objects in the structured dictionary D; symbol | · | non-conducting phosphor_FA Frobenius norm representing a computational matrix; parameters alpha, beta, lambda₁、λ₂The weighting coefficients of the different cost terms are balanced in the objective function.

The objective function for calculating the feature codes of the candidate object windows of the input image is as follows:

in the formula, x is the depth visual feature of an object window, y represents the object window feature code to be optimized and solved, D is a structured visual dictionary, and parameter eta represents the number of nonzero elements in the control feature code y, namely the sparsity of the feature code; symbol | · | non-conducting phosphor_FRepresenting the Frobenius norm of the computational matrix.

The invention has the following advantages and beneficial effects:

1. aiming at the problem that the relevance constraint among words is neglected in the process of learning the words of the visual dictionary by the current image classification method based on the visual dictionary learning, and the lack of image feature representation semantic information is caused, the method provided by the invention analyzes and solves the problem from the perspective of introducing the sharing characteristic of the visual pattern with practical value, realizes the enhancement of image feature representation semantic property and improves the accuracy of image category identification by cooperatively mining the visual dictionary words with the sharing characteristic in each semantic category.

2. The invention has the characteristics of no manual participation, high classification accuracy and the like. The method is different from the method for learning each dictionary word by an image classification method based on dictionary learning in isolation at present, introduces the sharing characteristic of the visual mode in the semantic object class, cooperatively mines the visual dictionary word with the sharing characteristic in the same semantic class, establishes the association constraint among the dictionary words, and improves the problem of lack of semantic information expressed by the visual characteristic of the current image.

3. The method of the present invention is practical and effective.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of a set of visual features based on an image object window according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The method is realized on a Matlab R2016b experimental platform, as shown in FIG. 1, the method mainly comprises two parts, specifically, an intra-class shared visual pattern mining part mainly relates to the following contents, including image object window generation, window depth feature extraction and visual dictionary learning based on intra-class shared characteristics; in the image global feature construction and classification part, four steps of input image object window generation and feature extraction, object window feature coding, visual feature integration and construction of image global features and SVM classifier semantic label prediction are involved. The method comprises the following specific steps:

in-class shared visual pattern mining:

step one, generating an image object window

Given a training set of images containing multiple semantic class objects, candidate object windows for each image are generated using the EdgeBox algorithm (see document c. lawrence Zitnick and Piotr Doll' ar. edge boxes: Locating object pro posalsfrom images in European Conference Computer Vision 2014.).

Step two, extracting depth features of image windows

The image candidate object window depth features are extracted using the VGG19 depth network model (see document k. simony and a. zip. version depth relational networks for large-scale image registration. in international conference on Learning retrieval, 2015.).

Step three, learning a visual dictionary based on in-class sharing characteristics

In order to mine a visual pattern with a sharing characteristic in each semantic category, the embodiment of the invention designs the following structured visual dictionary, and the mathematical form of the visual dictionary is as follows:

D＝[D₁，D₂，...，D_C]

d in the above formula represents a constructed visual dictionary, which is a dictionary D of each object class_iI 1, 2.., C, which represents the number of object classes contained by the image set.

And according to the depth characteristics of the candidate object windows of all semantic image classes, a structured visual dictionary D with the in-class sharing characteristic is obtained by optimizing a visual dictionary learning model constructed in the following way.

In the above formula, X_iVisual feature matrices for all training samples corresponding to the ith semantic object class, D_∈iRepresenting a class-specific visual dictionary in which dictionary words corresponding to the ith class in the structured dictionary D are reserved, and the rest semantic object class dictionary words are set to zero; a. the_iIs a visual feature matrix X_iClass specific dictionary D_∈iA matrix of representation coefficients of (a); d is a structured visual dictionary to be optimized, which is a set of dictionary words of all semantic object categories; z_iIs a feature matrix X_iRepresenting coefficients on a structured dictionary D; d_iAnd D_jRespectively representing visual dictionaries corresponding to the ith and jth class objects in the structured dictionary D; symbolNo. | · | non-conducting phosphor_FRepresenting the Frobenius norm of the computational matrix.

In the constructed visual dictionary learning model, the first two cost itemsAndreconstructing residual terms for data, whose effect is to exploit the learned class-specific visual dictionary D_∈iAnd a structured visual dictionary D, which realizes effective reconstruction of visual features of the ith semantic object category. Cost itemAnd selecting dictionary words corresponding to the ith class in the structured dictionary D for the consistency constraint of the introduced representation coefficients so as to reconstruct and represent the visual features of the class, thereby ensuring that the reconstruction coefficients of the feature data of the same object class have consistency. Cost item in dictionary learning modelDictionary D for each semantic object class_iAnd the orthogonality constraint between the words I and C can enhance the difference between different dictionary words, and ensure the discrimination capability of subsequent image feature coding. Last two terms lambda in the optimization model₁||A_i||_2，1，λ₂||Z_i||_2，1To represent the coefficient A_iAnd Z_iThe applied regularization constraint specifically adopts a method based on l in the embodiment_2，1The set of norms is sparse such that the matrix of solved representation coefficients is sparse by rows. In the dictionary learning model, will l_2，1Norm group sparse regularization term and representation coefficient consistency constraintPerforming joint optimization, on one hand, ensuring the i-th visual dictionary D_iThe reconstruction of the semantic category visual characteristics can also be carried outAnd effectively mining a visual mode with a sharing characteristic in the ith category, and finally improving the consistency of the feature representation of the same semantic category and the difference between the features of different semantic categories. Parameters alpha, beta and lambda in dictionary learning model₁、λ₂The weighting coefficients for balancing different cost terms in the objective function are empirically set to 0.01 through experiments. The dictionary learning objective function is a multivariable optimization problem, and iterative computation is performed by adopting an alternative optimization strategy. Specifically, when one variable in the objective function is optimized, the other variables are fixed, and then the original convex optimization problem is converted into a plurality of convex optimization sub-problems to be solved.

Constructing and classifying image global features:

step one, generating an input image object window and extracting characteristics

Given an input image with unknown semantic category, candidate object windows of the image are generated by utilizing an EdgeBox algorithm, and VGG19 deep network visual features of the candidate object windows are further extracted.

Step two, object window characteristic coding

And calculating the characteristic codes of the candidate object windows of the input images according to the acquired structured visual dictionary D. The specific mathematical form of the objective function is as follows:

In order to solve the above objective function, a Feature-design search (see literature (honglag Lee, Alexis Battle, Rajat Raina, and Andrew y. ng. efficient mapping algorithms, the Conference on Neural Information Processing Systems, pages 801-808.2007.) algorithm is specifically adopted to calculate the variable y to be optimized, i.e. to obtain the Feature code of the image window.

Step three, integrating visual features and constructing image global features

Based on the feature codes of all object windows of the input image, further borrowing the mode of traditional Max-Pooling feature integration (see the literature: Jiancho Yang, Kai Yu, Yihong Gong, and T. Huang. Linear spatial profiling using encoding for image classification. in IEEEConference on Computer Vision and Pattern Recognition, pages 1794-.

The traditional Max-Pooling feature integration method is based on image local interest points, and in order to embed space distribution information in image global features, a step of dividing different space scales of an image is added in the method. Different from the traditional method, the method is based on the image object window area, image space distribution and object semantic information are effectively introduced in the image global feature construction process, and finally the construction of the image global feature directly encodes all window features of the image and obtains the maximum value on each feature dimension.

Step four, forecasting semantic tags of SVM classifier

And (3) according to the image global feature representation constructed in the third step, predicting semantic category labels of the input image by using a linear SVM classifier (see the documents: R. -E.Fan, K. -W.Chang, C. -J.Hsieh, et al.Libliner: A library for large linear classification Research. journal of Machine Learning Research, 2008, 9: 1871-.

Table 1 evaluation of the accuracy of the method of the present invention and existing image classification methods on UIUC8 object recognition databases

As shown in the above table, the experiments were compared to existing methods on UIUC8 object recognition databases. The database contains image data of 8 different sports categories, and 1972 images. In order to calculate the classification accuracy of different methods, 70 images are randomly selected from the image sets of all classes as training data, 60 images are randomly selected from the rest images of the classes as test data, and the final classification accuracy is the average value of the classification accuracy of all the classes.

Note: the image classification method LLC in the above table, see literature (Jinjun Wang, Jiancho Yang, Kai Yu, Fengjun Lv, T.Huang, and Yihong Gong.Locality-constrained linear coding for image classification. in IEEE Conference on Computer Vision and Pattern recognition, pages 3360. 3367, 2010.); an image classification method LSC, see literature (Lingqi Liu, Lei Wang, and Xinwang Liu. in destination of soft-alignment coding. in IEEEInternational Conference on Computer Vision, pages 2486-; the image classification method CNN, see literature (k. simony and a. zisserman. very deep relational network for large-scale image recognition. in International Conference on learning information, 2015.).

Claims

1. An image classification method based on intra-class visual mode sharing is characterized by comprising the following steps:

2. The method of claim 1, wherein the generating of the candidate object window for each image in the image training set is implemented by an EdgeBox algorithm.

3. The method of claim 1, wherein the extracting the depth features of the candidate object window is performed by a VGG19 depth network model.

4. The method according to claim 1, wherein the optimized visual dictionary learning model is performed by the following formula:

5. The method according to claim 1, wherein the objective function for calculating the feature codes of the candidate windows of the input image is: