CN111191691B

CN111191691B - Fine granularity image classification method based on deep user click characteristics of part-of-speech decomposition

Info

Publication number: CN111191691B
Application number: CN201911296150.0A
Authority: CN
Inventors: 俞俊; 谭敏; 周剑
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2023-09-29
Anticipated expiration: 2039-12-16
Also published as: CN111191691A

Abstract

The invention discloses a fine-granularity image classification method based on deep user click characteristics of part-of-speech decomposition. The invention firstly uses the user click data obtained from the Internet, uses the techniques of word segmentation, word stem formation, stop word removal and the like of natural language processing to obtain words, and simultaneously obtains the part of speech of the words, the part of speech selects proper keywords from the obtained words, then uses the obtained keywords and the corresponding word frequency to obtain the frequency characteristics of the word frequency inverse document, then integrates the characteristic vectors obtained in the mode to obtain a characteristic tensor, and finally uses the characteristic to specially construct and be suitable for classifying the network of the characteristic. The invention can effectively solve the problem of semantic gap which cannot be overcome by the traditional method on the premise of obtaining high accuracy. Another benefit of this approach is that it is more suitable for practical production practice activities due to the small size of the network architecture, ease of deployment. The method finally achieves excellent results on the Clickture-Dog data set.

Description

Fine granularity image classification method based on deep user click characteristics of part-of-speech decomposition

Technical Field

The invention belongs to the field of Fine-grained image classification (Fine-Grained Image Categorization, FGIC), and unlike the traditional method for classifying images by means of visual characteristics, the method is classified by using text data in a mode different from other modes of the images and constructing an End-to-End (End-to-End) deep neural network (Deep Neural Network). The invention can realize the high-precision classification requirement without using the traditional complex visual characteristics and only using the User Click Data acquired from the Internet.

Background

The fine-grained image classification is a classical computer vision task, and is different from the traditional classification task, and the purpose of fine-grained vision classification is to distinguish different subcategories under the same species, so that the fine-grained vision classification becomes a very challenging task due to the fact that the differences among different subcategories are fine, and pictures under the same subcategory are interfered by factors such as light, background shielding and the like. In real life, there is also a great need to identify sub-categories of different species. For example, in ecological protection, the effective identification of different species of organisms is an important prerequisite for ecological research. If low cost fine-grained image recognition can be achieved by means of computer vision techniques, it is of great importance both to the academia and industry.

From the feature construction process, the fine-grained image classification method goes through a development process from manual feature engineering, to multi-stage classification, and then to End-to-End (End to End) learning; from the research approach, it goes through a development process that uses only image data, to incorporate additional annotation information, to use only other types of data, such as text data. Because of the large intra-class difference and the fine inter-class difference of the fine-granularity classification task, the traditional artificial feature engineering cannot achieve the ideal effect. With the development of deep learning in recent years, great opportunity is brought to fine-granularity classification tasks, and the development of a large number of deep neural network models promotes the field to be rapidly developed.

One of the goals of classifying fine-grained images using user click data has been to address the Semantic Gap (Semantic Gap) problem. The existence of semantic gaps makes algorithms based on visual feature classification deficient in the prior art. One can deceptively forge an image that is meaningless from a human perspective, but that is likely to be a piece of meaningful data from a computer perspective. Furthermore, the user click data belongs to text data, which is easier to store than images. In actual production practice, classification models based on text data are easier to deploy than image-based ones. Meanwhile, the rapid development of natural language processing (Natural Language Processing, NLP) technology also well assists in the fine-grained image classification task based on text. The above two points are unique advantages of text data and can be used to effectively address challenges in conventional fine-grained image classification based on visual features.

In existing methods of constructing image features using click data, an image is often characterized as its vector of number of clicks in the query text space. Since the text in click data is composed of one or more words, feature construction using click data is generally classified into two types, namely, a construction method based on query text (i.e., original text space) and on query keywords (i.e., words divided in text), respectively. The key problem of the method based on the query text is that the query text is huge in quantity, so that the click characteristics of a user are too sparse and the dimension is too high, and the extraction of depth characteristics is not facilitated. The method based on the query keywords not only solves the problems, but also gives consideration to the inherent relation between words. The invention is also based on query keywords.

Disclosure of Invention

As mentioned above, the invention provides a fine-granularity image classification method (fine-grained image classification with Factorized Deep Click feature, FDC) based on the click feature construction idea of the query keyword and the deep user click feature based on the word segmentation. The invention firstly uses the user click data obtained from the Internet, uses the techniques of word segmentation, word stem formation, stop word removal and the like of natural language processing to obtain words, and simultaneously obtains the part of speech of the words, the part of speech selects proper keywords from the obtained words, then uses the obtained keywords and the corresponding word frequency to obtain the feature of word frequency inverse document frequency (TF-IDF), then integrates the feature vectors obtained in the mode to obtain a feature tensor, and finally uses the feature to specially construct and be suitable for classifying the network of the feature. The method mainly comprises the following steps:

step (1): part-of-speech construction dictionary set

Firstly, applying the techniques of natural language fields such as word segmentation, word stem extraction, part-of-speech tagging and the like to a user query text of a training image; selecting a plurality of parts of speech, and selecting words with higher clicking times for word sets of different parts of speech to construct a dictionary. Next, a number of keywords are extracted from each dictionary to construct a corpus (corpus) that is used to generate TF-IDF features.

Step (2): TF-IDF click tensor for constructing image

Firstly, according to a TF-IDF algorithm, according to any part of speech, the image is characterized as a TF-IDF click vector by utilizing corresponding word corpus and user click data. Secondly, constructing TF-IDF click vector structures under different parts of speech into TF-IDF click tensors by using an outer product operation.

Step (3): training deep click networks

Firstly, a relatively shallow click neural network classification model is constructed by utilizing a convolution layer and a full connection layer, each convolution layer is subjected to an activation function and a normalization layer (Batch Normalization, BN), and the network is trained by utilizing a random gradient descent method.

Step (4): fine granularity image classification based on deep click network

And (3) performing the operations of the steps (1) - (3) on the image, and extracting the depth click feature vector of the image, so as to realize fine-grained image classification based on the depth click feature.

Further, the word segmentation in the step (1) constructs a dictionary set, and the specific operation is as follows:

1-1. A click dataset comprising n images, m' query texts is utilized. For any piece of text click data (x _i ，q _j ，c _i，j) wherein ,x_i 、q _i 、c _i，j The method comprises the steps of respectively clicking times of images, query texts and corresponding images and texts, and utilizing word segmentation technology to click any piece of text data (x _i ，q _j ，c _i，j ) Converting into the following word click set:

((x _i ，w _i，j，1 ，c _i，j )，(x _i ，w _i，j，2 ，c _i，j )，(x _i ，w _i，j，3 ，c _i，j ) ,..) formula (1)

1-2 the operation of equation (1) is performed on all text click data resulting in a result of (x) _i ，w _i，j，k ，c _i，j ) The word click set is formed, and parts of speech reduction, repeated word combination and parts of speech tagging are sequentially carried out on the words in the set, so that a click matrix C' formed by images, words and corresponding image-word click times is obtained. For words in C', the words are divided into M mutually disjoint sets according to part of speech. For the m-th part-of-speech set, selecting the front rho with the most clicks _m Words forming part of speech dictionary of the mth kindThis is expressed as:

the words in (a) constitute the corpus (corpus) required to generate TF-IDF features at the mth part of speech.Is indicative of->Is the j-th word in (c).

Further, the specific construction steps of the TF-IDF click tensor of the image in the step (2) are as follows:

2-1 utilizing the C' sum in step 1-2Building an image x _i TF-IDF click vector of the mth part of speech. Select word->Its corresponding click number is denoted +.> wherein />For image x _i In the mth part-of-speech word w _j Total number of clicks down. Constructing image x using C' and TF-IDF algorithms _i TF-IDF click vector of part of speech m->The j-th element is defined as follows:

wherein ,to indicate a function. n is the total number of images. ρ' _j Representing the frequency of the jth element in all images (number n) for calculating the inverse document frequency in the TF-IDF algorithm.

2-2, after the step 1-2 shows on all M medium parts of speech, TF-IDF click vectors under M different parts of speech are obtained. The vector set composed of these different part-of-speech TF-IDF click vectors is denoted as V _i It is defined as follows:

2-3 TF-IDF click vector set V Using part-of-speech decomposition _i The TF-IDF of the constructed image clicks on the tensor t, the elements of which are constructed as follows:

wherein ,as part-of-speech fusion function, it can be defined as any reasonable fusion operation (e.g., product, sum, average, maximum, etc.). One point to be particularly stated is that the TF-IDF click tensor t is an M-mode tensor.

Further, the specific structure of the network in the step (3) is as follows:

3-1. Network overall structure:

a structure of 4 convolutional layers plus 2 fully-connected layers is employed. The first half of the network is the convolutional layer and the second half is the fully-connected layer. Each convolution layer is followed by a Pooling layer, BN layer and ReLU layer. A Dropout layer with an inactivation rate of 0.8 is added between the two fully connected layers.

Of the four convolution layers described in step 3-1, each convolution layer has an M-mode 1-dimensional convolution, which is a convolution module composed of M consecutive one-dimensional convolution kernels, wherein the mth convolution kernel is performed on M-mode expansion of the point tensor, so that the network is better adapted to our data, and excellent recognition performance is obtained, which is one of the core innovation points of the present invention.

The invention has the beneficial effects that:

the invention is different from the traditional fine-grained image classification method, adopts click data with more abundant semantics to construct training data, and simultaneously provides a deep neural network specially aiming at the data, and on the basis, provides a fine-grained image classification method (fine-grained image classification with Factorized Deep Click feature, FDC) based on the participle deep user click characteristics. The method can effectively solve the problem of semantic gap which cannot be overcome by the traditional method on the premise of obtaining high accuracy. Another benefit of this approach is that it is more suitable for practical production practice activities due to the small size of the network architecture, ease of deployment. The method finally achieves excellent results on the Clickture-Dog data set.

Drawings

Fig. 1 is a schematic diagram of the overall framework and network architecture of the present invention.

FIG. 2 is a comparison of the accuracy achieved by the present invention on the Clickture-Dog dataset with other advanced methods.

Detailed description of the preferred embodiments

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in the frame diagram of fig. 1. A fine-grained image classification method based on the depth user click characteristics of the participle is specifically realized as follows:

the triplet (i, q, c) is first used as the input of the step (1), and after the processing of the step (1), the result shown in the first part of the frame graph, namely "Click Counts", can be obtained, namely the Click vector of one image.

Then taking the output of the step (1) as the input of the step (2), and after the processing of the step (2), we can obtain the TF-IDF characteristic of the part-of-speech of an image, namely the characteristic called "Factorized Features" in the second part.

Finally, the result of step (2) is input into step (3), and finally the click tensor needed by us is obtained, as shown in the third part. This is also the bright point of the present invention.

As shown in FIG. 2, the accuracy of the present invention in the click-Dog is compared with other advanced methods, the schematic diagram of the network structure is a part of the black solid line box in FIG. 1, and the specific structure of the convolution module is shown in the dashed line box. We need only input the click tensor from step (3) into the network to train the network. In the training stage, a random gradient method is utilized to train network parameters, and a counter propagation algorithm is utilized to update gradients; and in the test stage, after the data is input into the network, the classification result is obtained through a Softmax layer.

To fully illustrate the effectiveness of the present invention, we have compared with advanced methods in the current fine-grained image classification field, resulting in convincing results. Wherein LSTM (Long Short Term Memory) is a logo method of processing text data; VGG (Visual Geometry Group Net) is a classical model in the field of vision and has important reference significance; HBP (Hierarchical bilinear pooling Model) and NTS (Navigator-Teacher-Scrutinizer Network) are model methods recently proposed in the field and are excellent in performance. From the results, the contribution of the present invention in the field of fine-grained image classification is also enormous.

Claims

1. The fine-granularity image classification method based on the deep user click characteristics of part-of-speech decomposition comprises the following steps:

step (1): part-of-speech construction dictionary set

Firstly, performing word segmentation, word stem extraction and part-of-speech tagging on a user query text of a training image, selecting a plurality of parts-of-speech, and selecting words with higher clicking times for word sets with different parts-of-speech to construct a dictionary;

secondly, extracting a set number of keywords from each dictionary to construct and generate a corpus of TF-IDF features;

step (2): TF-IDF click tensor for constructing image

Firstly, according to any part of speech and a TF-IDF algorithm, representing an image as a TF-IDF click vector by utilizing corresponding word corpus and user click data; secondly, constructing TF-IDF click vector structures under different parts of speech into TF-IDF click tensors by using outer product operation;

step (3): training deep click networks

Firstly, constructing a relatively shallow click neural network classification model by utilizing a convolution layer and a full connection layer, passing through an activation function and a normalization layer after each convolution layer, and training the network by utilizing a random gradient descent method;

step (4): fine granularity image classification based on deep click network

Performing the operations of the steps (1) - (3) on the image, extracting depth click feature vectors of the image, and therefore achieving fine-grained image classification based on the depth click features;

the word segmentation of the step (1) constructs a dictionary set, and the specific operation is as follows:

1-1. Using click data containing n images and m' query textsA collection; for any piece of text click data (x _i ，q _j ，c _i，j) wherein ,x_i 、q _i 、c _i，j The method comprises the steps of respectively clicking times of images, query texts and corresponding images and texts, and utilizing word segmentation technology to click any piece of text data (x _i ，q _j ，c _i，j ) Converting into the following word click set:

((x _i ，w _i，j，1 ，c _i，j )，(x _i ，w _i,j,2 ，c _i，j )，(x _i ，w _i,j,3 ，c _i，j ) ,..) formula (1)

1-2 the operation of equation (1) is performed on all text click data resulting in a result of (x) _i ，w _i,j,k ，c _i，j ) Sequentially performing part-of-speech reduction, repeated word merging and part-of-speech tagging on the words in the set to obtain a click matrix C 'consisting of images, words and corresponding image-word click times'；For words in C', dividing the words into M mutually disjoint sets according to parts of speech; for the m-th part-of-speech set, selecting the front rho with the most clicks _m Words forming part of speech dictionary of the mth kindThis is expressed as:

the words in (a) constitute the corpus required for generating TF-IDF features under the mth part of speech; />Is indicative of->The first of (3)j words.

2. The fine-grained image classification method based on deep user click features of part-of-speech decomposition according to claim 1, wherein the specific construction steps of TF-IDF click tensor of the image in the step (2) are as follows:

2-1 utilizing the C' sum in step 1-2Building an image x _i TF-IDF click vector of the mth part of speech; selecting wordsIts corresponding click number is denoted +.> wherein />For image x _i In the mth part-of-speech word w _j Total number of clicks down; constructing image x using C' and TF-IDF algorithms _i TF-IDF click vector of part of speech m->The j-th element is defined as follows:

wherein ,is an indication function; n is the total number of images; ρ' _j Representing the frequency of the jth element in all images for calculating the inverse document frequency in the TF-IDF algorithm;

2-2, after the step 1-2 shows on all M Chinese parts of speech, TF-IDF click vectors under M different parts of speech are obtained; the vector set formed by the TF-IDF click vectors with different parts of speech is recordedV as _i It is defined as follows:

wherein ,as part-of-speech fusion function, it can be defined as any reasonable fusion operation including product, sum, average, and maximum; the TF-IDF click tensor t is an M-mode tensor.

3. The fine-grained image classification method based on deep user click features of part-of-speech decomposition according to claim 2, wherein the specific structure of the network in the step (3) is as follows:

3-1. Network overall structure:

a structure of 4 convolution layers plus 2 full connection layers is adopted; the front half part of the network is a convolution layer, and the rear half part is a full connection layer; a Pooling layer, a BN layer and a ReLU layer are added behind each convolution layer; a Dropout layer with the inactivation rate of 0.8 is added between the two full-connection layers;

of the four convolution layers described in step 3-1, each convolution layer has an M-mode 1-dimensional convolution, which is a convolution module consisting of M consecutive one-dimensional convolution kernels, where the mth convolution kernel is performed on the M-mode expansion of the point tensor, thereby enabling the network to better adapt the data.