CN114299343A

CN114299343A - Multi-granularity information fusion fine-granularity image classification method and system

Info

Publication number: CN114299343A
Application number: CN202111664965.7A
Authority: CN
Inventors: 胡建国; 杨学彬; 肖辉敏; 卢星宇; 吴劲; 王德明
Original assignee: Development Research Institute Of Guangzhou Smart City; Sun Yat Sen University; Shenzhen Research Institute of Sun Yat Sen University
Current assignee: Development Research Institute Of Guangzhou Smart City; Sun Yat Sen University; Shenzhen Research Institute of Sun Yat Sen University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-08

Abstract

The invention discloses a multi-granularity information fusion fine-granularity image classification method and a system, wherein an image data set is formed by inputting images; constructing a triple through a global interference module; inputting the triples into a CNNs backbone network for training by a progressive multi-granularity information fusion training strategy to obtain an optimized classification model; the method classifies the input images through the optimized classification model so as to obtain the final classification result, does not need to use any manual labeling, and is realized through two mutually cooperative processes: the global interference and the progressive multi-granularity information fusion enable the network to fuse information of different granularities, so that local features with higher discriminability are found out, and the identification precision can be greatly improved on all data sets by the reference network and the method.

Description

Multi-granularity information fusion fine-granularity image classification method and system

Technical Field

The invention belongs to the technical field of machine learning technology and image classification, and particularly relates to a multi-granularity information fusion fine-granularity image classification method and system.

Background

Fine-grained image recognition is a very challenging task in the field of computer vision, usually to distinguish sub-categories under the same super-category. Unlike general image recognition, there is a similar structure and only slight differences between different sub-categories, which results in low variance between fine-grained image categories. In addition, due to uncertain factors such as illumination, occlusion and the like, images in the same category have high variance. Therefore, a method of fine-grained image recognition must be able to accurately find subtle differences between different sub-classes of the same super-class, which is a challenging task.

Most of the existing fine-grained image recognition methods can be divided into two categories, namely strong supervision recognition and weak supervision recognition, according to the used supervision information. The two methods use different monitoring information in the training process, so that the algorithms of the two methods are greatly different. The strong supervision identification method can use other manual labeling information such as position frames besides the category label information, the method needs extensive labeling, so that time and labor are wasted, and fine-grained categories can be distinguished only by sufficient professional knowledge, so that the method is a huge bottleneck. With the rise of deep learning and transfer learning, a weakly supervised identification method only requiring image level labels begins to occupy a mainstream position, but most of the method depends on weights pre-trained by large-scale annotation data (such as ImageNet data sets).

Self-supervised learning has made a great breakthrough in the last years because no labeled information is needed in the pre-training process and results similar to or even higher than those of the supervised learning can be obtained, so that the self-supervised learning gradually becomes a trend, but the self-supervised method is a new paradigm in the field of fine-grained image classification. This patent has studied the fine-grained image classification problem under the self-supervised learning.

The existing fine-grained image classification method can be roughly divided into strong supervision identification and weak supervision identification according to the amount of supervision information used in training. The strong supervision identification method is time-consuming and labor-consuming because of the need of additional manual labeling and the need of strong expert knowledge for judging the fine-grained image category. At present, the mainstream weak supervision fine-grained image classification method only uses image-level label information to extract distinguishing information from training data, although good results are obtained, most of the methods heavily depend on the weight of large-scale data set (such as ImageNet data set) pre-training, however, the target of the ImageNet data set pre-training does not consider the characteristics of downstream classification tasks, and therefore the obtained model is suboptimal for the fine-grained classification tasks. Therefore, in the classification of fine-grained images, it is necessary to design a learning method which can successfully learn the visual representation of the image without manual labeling. In summary, the common problems of the fine-grained image method are as follows:

first, fine-grained image classification requires more expert knowledge than general image classification, and manual labeling of these data is cost prohibitive.

Secondly, most of the existing fine-grained image classification methods rely on a pre-training model on a large-scale data set (ImageNet), but the model does not consider the characteristics of downstream tasks, so that the model is suboptimal for fine-grained classification tasks.

Disclosure of Invention

The invention aims to provide a technical scheme of a multi-granularity information fusion fine-granularity image classification method and system, which are used for solving one or more technical problems in the prior art and at least providing a beneficial selection or creation condition.

In order to solve the problems, the invention provides a multi-granularity information fusion algorithm based on self-supervision contrast learning based on the characteristics of large intra-class variance and small inter-class variance of fine-granularity images, and the multi-granularity information fusion algorithm is used for a fine-granularity image classification task. The algorithm can improve the accuracy of the self-supervision fine-grained image classification without using any artificial label, and greatly closes the difference between the self-supervision learning and the supervision learning in the fine-grained image classification field without using the ImageNet data set pre-training weight. The self-supervision algorithm mainly comprises two stages of pre-training and fine-tuning.

In order to achieve the above object, according to an aspect of the present invention, there is provided a multi-granularity information fusion fine-granularity image classification method, including:

s100, inputting N images to form an image data set;

s200, randomly extracting an image from an image data set, and randomly cutting 2 images and the randomly extracted image of the image data set from the image;

s300, constructing a triple through a global interference module;

s400, inputting the triples [ a, p, n ] into a CNNs backbone network for training through a progressive multi-granularity information fusion training strategy to obtain an optimized classification model;

and S500, classifying the input images through the optimized classification model to obtain a final classification result.

Given an unmarked input image data set batch containing N images { x ═ x₁,x₂,…,x_NFor a certain picture c in each batch in the training process, firstly, two randomly cut pictures in the pictures c are generated through enhancement of a random data set, and the picture c is defined as c₁,c₂And any other picture in the batch is defined as c₃；

Constructing the triple [ a, p, n ] by means of a global perturbation module, defined as f (·)]Wherein, the anchor sample (anchor) is recorded as a ═ f (c)₁) Positive sample (positive sample) is denoted as p ═ f (c)₂) The negative sample is denoted as n ═ f (c)₃). For the triplet [ a, p, n]The anchor sample a and the positive sample p form a positive sample pair, the anchor sample a and the negative sample n form a negative sample pair, and the positive sample pair comes from the same image, and the negative sample pair comes from different images, so that the positive sample pair carries similar semantic content or visual features;

inputting the triples [ a, p, n ] into a CNNs backbone network for training, projecting the features obtained by training to a D-dimensional embedding space through a multilayer perceptron (MLP), then carrying out L2 normalization, and finally calculating the contrast loss for back propagation, wherein the pre-training process does not need to use any information such as manually labeled labels;

wherein, the training is a progressive multi-granularity information fusion training strategy;

the method used in the pre-training process and the calculation of the contrast loss are described in detail below.

(1) Global interference module

Since the class-to-class differences of fine-grained images are small, in most cases, different fine-grained classes often share similar global information, with only different local details. Therefore, we design a global perturbation module to generate the positive and negative sample pairs, which can make the neural network focus better on the local discriminant features by destroying the global structure.

Specifically, the method comprises the following steps:

the specific method in the global interference module is as follows: the image is divided into M multiplied by M blocks of subareas, then each subarea is randomly arranged with the same probability and is combined into a new image, M is a hyper-parameter and is used for controlling the generation of the images with different granularity information, and the images with different granularity information can be obtained by setting different M values.

Through the global interference, the global semantic information of the image is damaged, the neural network is forced to pay more attention to the local semantic information to complete discrimination, and the method can make the network more sensitive to local features without an accurate positioning frame.

(2) Progressive multi-granularity information fusion training strategy

Since the variance in the fine-grained image class is large, the discrimination information of the fine-grained class naturally exists in different visual granularities, and the discrimination region cannot be sufficiently found out by the local information of a single granularity. In order to make the network fully utilize the information among different granularities and further better find out the discrimination information, a simple progressive multi-granularity information fusion training strategy is designed. The strategy and the global interference module are cooperated together, the global interference module encourages the network to learn information with a certain specific granularity, and the progressive multi-granularity training strategy encourages the network to fuse information with different granularities, so that the information with different granularities can cooperate with each other, and the influence caused by large intra-class variance is avoided.

Specifically, the progressive multi-granularity information fusion training strategy is to divide the pre-training process into S stages uniformly, and each stage sends the image of different granularity information automatically generated by the global interference module to the CNNs backbone network. At each stage, the emphasis of training is to let the network learn information of a certain granularity. The training method is similar to reinforcement learning, and at the end of each training stage, the parameters trained in the current stage are transferred to the next training step to be initialized as parameters, and the transfer operation essentially enables the network to mine more granular information based on the region learned in the previous training step, so that the complementary relation between different granular information is fully explored.

(4) Multi-layer perceptron MLP

Before the target loss function is calculated, MLP-based nonlinear projection is adopted, so that the invariant feature of each input image can be identified, and the capability of a network for identifying different transformations of the same image is improved to the maximum extent. The multi-layer sensor MLP adopts two full-connection layers, so that the nonlinear information of data can be learned, the characteristics learned by a backbone network are enhanced, and the common information characteristics of the same class of data can be obtained through the learning of the step.

(5) Loss calculation

Firstly, selecting NCE-Loss, wherein the purpose of using NCE-Loss is to draw a positive sample pair with strong similarity in a hidden space close and push a negative sample pair far, and the NCE-Loss formula is as follows:

where q represents the query sample, k⁺Represents the corresponding positive sample, k^-Representing other negative examples, τ is a hyper-parameter used to measure the distance distribution;

characteristic triplet (z) obtained by encoding triplet through CNNs_a,z_p,z_n) Inserting into NCE-Loss, and setting the hyperparameter τ to 1, then NCE-Loss is optimized to the following form:

wherein,

k is the number of samples in the image dataset.

2. Fine tuning process

After the pre-training phase is completed, we migrate the pre-trained model weights to downstream tasks for fine-tuning. The fine tuning process follows common image classification, the weight of a model pre-trained by the user is used as initialization during fine tuning, the model is optimized by adopting cross entropy loss in the fine tuning process to obtain an optimized classification model, and the input image is classified through the optimized classification model to obtain a final classification result.

The invention also provides a multi-granularity information fusion fine-granularity image classification system, which comprises: the processor executes the computer program to realize the steps in the multi-granularity information fusion fine-granularity image classification method, the multi-granularity information fusion fine-granularity image classification system can be operated in computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud data center, the operable system can include, but is not limited to, a processor, a memory and a server cluster, and the processor executes the computer program to operate in the following units of the system:

an image input unit for inputting N images to constitute an image data set;

the image selecting unit is used for randomly extracting an image from the image data set and randomly cutting 2 images and the randomly extracted image of the image data set from the image;

the global interference unit is used for constructing a triple through a global interference module;

the progressive training unit is used for inputting the triples [ a, p, n ] into the CNNs backbone network for training to obtain an optimized classification model through a progressive multi-granularity information fusion training strategy;

and the image classification unit is used for classifying the input images through the optimized classification model so as to obtain a final classification result.

The invention has the beneficial effects that: the invention provides a multi-granularity information fusion fine-granularity image classification method and a system, wherein the method does not need any manual marking and adopts two mutually cooperative processes: the global interference and the progressive multi-granularity information fusion enable the network to fuse information of different granularities, and therefore local features with better discriminability are found out. A large number of test experiments are carried out on the classical fine-grained image classification data sets CUB, Stanford Cars and Aircraft, the recognition accuracy can be greatly improved on all data sets by the aid of the reference network and the method, and results on the Stanford Cars and the Aircraft even exceed those of ImageNet supervised learning.

Drawings

The above and other features of the present invention will become more apparent by describing in detail embodiments thereof with reference to the attached drawings in which like reference numerals designate the same or similar elements, it being apparent that the drawings in the following description are merely exemplary of the present invention and other drawings can be obtained by those skilled in the art without inventive effort, wherein:

fig. 1 is a structural diagram of a multi-granularity information fusion fine-granularity image classification system.

Detailed Description

The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

As shown in fig. 1, it is necessary to construct positive and negative examples for an input image, and the algorithm constructs positive examples with different granularity information for each image through a global interference module, while treating other images as negative examples of the image. Compared with the original input image, the positive sample is the transformation of the original image, the global information of the positive sample is damaged, but the local information is reserved, so that the original image and the positive sample form a positive sample pair. Negative examples are transformations of other images such that a series of negative example pairs are formed between the original image and the negative examples. And then, sending the positive and negative samples with different granularity information generated by the global interference module into an encoder, pre-training the positive and negative samples with different granularity information in a halving stage by adopting a multi-granularity information fusion training mode, taking the weight trained in the previous stage as the initialization of the next training stage, and enabling the network to fuse the different granularity information through the incremental learning process so as to effectively avoid the problem of large intra-class variance. The whole process does not use any manual marking information. And then, transferring the trained model to a fine-grained data set for fine adjustment to obtain a final classification result.

Giving a picture containing N picturesMarked input image dataset batch ═ { x ═ x₁,x₂,…,x_NFor a certain picture c in each batch in the training process, firstly, two randomly cut pictures in the pictures c are generated through enhancement of a random data set, and the picture c is defined as c₁,c₂And any other picture in the batch is defined as c₃；

(1) Global interference module

Specifically, the method comprises the following steps:

the specific method in the global interference module is as follows: the image is divided into M × M blocks of sub-regions, and then each sub-region is randomly arranged with the same probability and merged into a new image, M is a hyper-parameter for controlling the generation of images with different granularity information, and setting different M values can obtain images with different granularity information, preferably, M is [1,8 ].

(2) Progressive multi-granularity information fusion training strategy

(5) Multi-layer perceptron MLP

(5) Loss calculation

wherein,

k is the number of images in the image dataset.

2. Fine tuning process

An embodiment of the present invention provides a multi-granularity information fusion fine-granularity image classification system, which is a structural diagram of the multi-granularity information fusion fine-granularity image classification system shown in fig. 1, and the multi-granularity information fusion fine-granularity image classification system of the embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the steps in one of the above-described embodiments of the multi-granular information fusion fine-grained image classification system when executing the computer program.

The system comprises: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:

an image input unit for inputting N images to constitute an image data set;

The multi-granularity information fusion fine-granularity image classification system can be operated in computing equipment such as desktop computers, notebooks, palm computers and cloud servers. The system for classifying the multi-granularity information fusion fine-granularity image can be operated by a system comprising but not limited to a processor and a memory. Those skilled in the art will appreciate that the example is merely an example of a multi-granular information fusion fine-grained image classification system, and does not constitute a limitation of a multi-granular information fusion fine-grained image classification system, and may include more or less components than a sub-scale, or combine certain components, or different components, for example, the multi-granular information fusion fine-grained image classification system may further include an input-output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. The general processor may be a microprocessor or the processor may be any conventional processor, and the processor is a control center of the multi-granularity information fusion fine-granularity image classification system operating system, and various interfaces and lines are used to connect various parts of the whole multi-granularity information fusion fine-granularity image classification system operable system.

The memory may be used to store the computer program and/or module, and the processor may implement various functions of the multi-granular information fusion fine-grained image classification system by running or executing the computer program and/or module stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

While the present invention has been described in considerable detail and with particular reference to a few illustrative embodiments thereof, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the invention by providing a broad, potential interpretation of such claims in view of the prior art with reference to the appended claims. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims

1. A multi-granularity information fusion fine-granularity image classification method is characterized by comprising the following steps:

s100, inputting N images to form an image data set;

s300, constructing a triple through a global interference module;

2. The method for classifying the multi-granularity information fusion fine-granularity image according to claim 1, wherein in S200, an image is randomly extracted from an image data set, and 2 pictures and the randomly extracted image of the image data set are randomly cropped from the image by: given an unmarked input image data set batch containing N images { x ═ x₁,x₂,…,x_NFor a certain picture c in each batch in the training process, the student is first enhanced by a random data setTwo randomly cut pictures in the picture c are defined as c₁,c₂And any other picture in the batch is defined as c₃。

3. The method for classifying the multi-granularity information fusion fine-granularity image according to claim 2, wherein in S300, the method for constructing the triplet through the global interference module is as follows: construction of triplets [ a, p, n ] by global perturbation modules]Wherein, the anchor sample is marked as a ═ f (c)₁) The positive sample is denoted as p ═ f (c)₂) The negative sample is denoted as n ═ f (c)₃). For the triplet [ a, p, n]The anchor sample a and the positive sample p form a positive sample pair, the anchor sample a and the negative sample n form a negative sample pair, the positive sample pair is from the same image, and the negative sample pair is from different images, so the positive sample pair carries similar semantic content or visual features, and the global interference module is defined as f (·).

4. The method for classifying the multi-granularity information fusion fine-granularity image according to claim 3, wherein in S400, the method for inputting the triples [ a, p, n ] into the CNNs backbone network for training to obtain the optimized classification model by using the progressive multi-granularity information fusion training strategy comprises the following steps: inputting the triples [ a, p, n ] into a CNNs backbone network for training, projecting the trained features to a D-dimensional embedding space through a multi-layer perceptron MLP, then carrying out L2 normalization, and finally calculating the contrast loss for back propagation, wherein the pre-training process does not need to use any manually labeled information such as labels; wherein, the training is a progressive multi-granularity information fusion training strategy.

5. The method for classifying the multi-granularity information fusion fine-granularity image according to claim 3, wherein in S300, the specific method in the global interference module is as follows: the image is divided into M multiplied by M blocks of subareas, then each subarea is randomly arranged with the same probability and is combined into a new image, M is a hyper-parameter and is used for controlling the generation of the images with different granularity information, and the images with different granularity information can be obtained by setting different M values.

6. The method of claim 4, wherein in S300, the progressive multi-granularity information fusion training strategy is to divide the pre-training process into a plurality of stages uniformly, and each stage sends the image with different granularity information automatically generated by the global interference module to the CNNs backbone network; at the end of each training phase, the parameters of the training of the current phase will be passed to the next training step to be initialized as parameters.

7. The method of claim 4, wherein in S400, the loss function of CNNs backbone network is l_c：

Combining the three groups [ a, p, n ]]CNNs-encoded feature triplets (z)_a,z_p,z_n) Inserted into the loss function, the loss function is then of the form:

wherein,

k is the number of samples in the image dataset.

8. A multi-granularity information fusion fine-granularity image classification system is characterized by comprising: the processor, the memory and the computer program stored in the memory and running on the processor, when executing the computer program, implement the steps in the multi-granularity information fusion fine-granularity image classification method in claim 1, wherein the multi-granularity information fusion fine-granularity image classification system can be operated in computing devices of desktop computers, notebooks, palmtop computers and cloud data centers.