CN111860672B

CN111860672B - Fine-grained image classification method based on block convolutional neural network

Info

Publication number: CN111860672B
Application number: CN202010738474.1A
Authority: CN
Inventors: 马占宇; 谢吉洋; 杜若一; 司中威
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2021-03-16
Anticipated expiration: 2040-07-28
Also published as: CN111860672A

Abstract

A fine-grained image classification method based on a block convolutional neural network relates to the technical field of fine-grained image identification, solves the problem that the existing method inputs an original image into the convolutional neural network after being averagely blocked for fine-grained image classification, and has weak reception field limitation. The invention limits the convolution receptive field according to the requirement, so that the network focuses more on the characteristics of the local area and is more suitable for the fine-grained image classification task. The fine-grained image classification method limits the receptive field range of the convolutional layer on the premise of not introducing more parameters, so that a convolutional neural network can search a smaller discriminative local area.

Description

Fine-grained image classification method based on block convolutional neural network

Technical Field

The invention relates to the technical field of fine-grained image recognition, in particular to a fine-grained image classification method based on a block convolutional neural network.

Background

In the technical Field of fine-grained image recognition, most of the existing methods based on artificial intelligence and deep learning directly input images into a Convolutional Neural Network (CNN), a Feature Map is extracted from an output Feature Map (Feature Map) of a previous layer through multilayer convolution and pooling layer operation, a Feature Map with a larger Receptive Field (RF), namely a range in which each Feature point on the Feature Map is mapped onto an input image, is extracted layer by layer, and finally a Feature Map with the Receptive Field being the size of the whole image (the theoretical Receptive Field may be larger than the size of the whole image) is obtained and used for fine-grained image classification. However, most existing methods are mainly used to identify the type of object in the image, such as different colored wings and different shaped beaks in birds, different shaped lights and tires in automobiles, by finding discriminative local areas on the image. In this case, the smaller field of view enables the model to better extract local features on the image, and thus to search for smaller discriminative local regions. However, the existing convolutional neural network framework mainly introduces operations with higher complexity and larger parameter amount, but still has difficulty in limiting the receptive field size of the convolutional layer.

Fine-Grained Visual Classification (FGVC) is a sub-task of the traditional image Classification task, which refers to more refined Classification of objects of a certain class, for example: distinguish different kinds of birds or dogs, different models of automobiles or airplanes, and the like. The fine-grained degree is more challenging than the traditional classification task because the difference between the target object and the objects of the same category may be larger than the difference between the target object and the objects of different categories, for example, two birds of the same category may have a great difference due to different postures; however, two birds of different species may have differences in structure and texture only in local areas such as the beak and the tail of a bird due to their close physical statures.

With the development of deep learning, CNN has become a mainstream solution for the task of image classification. CNN is mainly composed of the following parts: (1) a convolutional layer for feature extraction; (2) the pooling layer is used for feature selection and information filtering; (3) and the full connection layer is used for carrying out nonlinear combination on the extracted features to obtain the final output. In CNN, the concept of RF refers to a range in which a feature point on an output feature map of a specified layer is mapped onto an input picture, and both a convolutional layer and a pooling layer have the effect of increasing the receptive field, and the receptive field relationship between adjacent layers of a network is calculated in the following manner:

wherein r is^(l)The reception field, k, of the first layer of the convolutional or pooling layer^(l)Refers to the kernel size, s, of the first convolutional or pooling layer^(l′)Refers to the step size of the i' th convolutional or pooling layer.

The existing fine-grained classification methods are mainly divided into two types: (1) based on the local positioning method, the convolutional neural network is required to be used for extracting features, a plurality of discriminative areas are found, the areas are cut from an original image, and feature extraction and classification operations are respectively executed, so that the prediction time is long; in addition, the number of regions for classification is mostly set in advance in the method, and the flexibility of the model is greatly limited. (2) Most of methods based on end-to-end feature coding generate a high-dimensional vector before a full connection layer to improve the model expression capability to adapt to a fine-grained classification task. The extra computation amount brought by the excessively high dimensionality greatly limits the model efficiency.

For a traditional convolutional neural network, the general receptive field is very large, and for a general image classification task, the model can be judged according to information in a larger range; however, for fine-grained tasks, too large a reception field increases the influence of intra-class differences on the network, making it difficult to focus on local details.

The existing document, "fine-grained visual classification based on jigsaw and progressive multi-grained learning" is a method that an original image is averagely blocked, disturbed and blocked and then directly input into a convolutional neural network for fine-grained image classification, and is different in that (1) the method only blocks the original image; (2) the method limits the receptive field by a method of disturbing the blocks, and the limitation is weaker.

Disclosure of Invention

The invention provides a fine-grained image classification method based on a block convolutional neural network, aiming at solving the problem that the existing method has weak receptive field limitation when an original image is input into the convolutional neural network after being averagely blocked for fine-grained image classification.

A fine-grained image classification method based on a block convolutional neural network is characterized in that the block convolutional neural network is set to have L block convolutional layers, wherein L is the number of layers of the current block convolutional layers, L is more than or equal to 1 and is less than or equal to L, and the initialization is that L is equal to 1; the method is realized by the following steps:

step one, for the first block convolution layer f (·; omega)^(l)) Obtaining its input characteristic diagram as x^(l)(ii) a The above-mentioned

For the convolution kernel parameters, R represents a real number, c^(l)For inputting the number of channels of the feature map, c^(l+1)For the number of channels of the output profile,. the input of the representation function,

and

for each convolution kernel width and height;

the dimension of expression is

A real matrix of (d);

the input feature map

Is the output characteristic diagram, x, of the (l-1) th block convolutional layer⁽¹⁾As model input, W^(l)And H^(l)Width and height of the input feature map;

step two, when l is equal to 1, setting m₁＝n ₁1 is ═ 1; when l > 1, the number of blocks m per line on the input feature map is calculated by the following formula_lAnd the number of blocks per column n_l：

In the formula (I), the compound is shown in the specification,

and

are respectively an input feature map x^(l)Has a width and a height of a theoretical receptive field, and

and

is the contraction factor of the theoretical receptive field in the width and height dimensions,

and

step sizes of convolution kernels of the l' th layer of block convolution layer in the width and height dimensions of the feature map respectively,

the operation of rounding up is carried out;

step three, according to the number m of blocks in each row and each column on the input feature map obtained in the step two_lAnd n_lRandomly sampling to obtain the width of the feature map block

And height

And is

i＝1，…，m_lAnd

j＝1，…，n_lare all positive integers, and are not limited to the integer,

step four, dividing the block width according to the characteristic diagram obtained in the step three

And height

Feature map x to be input^(l)Is divided into m_l×n_lBlock, obtaining a set of block feature maps

Step five, adopting the convolution kernel parameter omega in the step one^(l)Respectively comparing all obtained in step four

Performing convolution to obtain corresponding convolution output characteristic diagram

Step six, the convolution output characteristic diagram obtained in the step five is used

Splicing according to the original position to obtain the output characteristic diagram of the first convolution layer in the block convolution neural network

Step seven, for the L block convolution layers, the operation is carried out according to the steps from one step to six until the output characteristic diagram x of the last L block convolution layer is obtained^(L+1)X is to be^(L+1)Inputting the data into a full connection layer to obtain the output probability p ∈ R of fine-grained image classificationⁿAnd n is the number of categories, so that the classification of fine-grained images is realized.

The invention has the beneficial effects that: the fine-grained image classification method can limit the experience field of convolution according to requirements, enables the network to pay more attention to the characteristics of local areas, and is more suitable for being applied to fine-grained image classification tasks. Meanwhile, additional parameters and operation are not introduced, and the high efficiency of the general convolutional neural network can be reserved in the prediction process.

The fine-grained image classification method does not need the characteristic of overlarge receptive field to divide the input characteristic graph into blocks, and each block is spliced again after being respectively subjected to convolution operation, so that the method has strong limitation.

The fine-grained image classification method limits the receptive field range of the convolutional layer on the premise of not introducing more parameters, so that a convolutional neural network can search a smaller discriminative local area.

Drawings

Fig. 1 is a flowchart of a fine-grained image classification method based on a block convolutional neural network according to the present invention.

FIG. 2 is a schematic diagram of a fine-grained image classification method based on a block convolutional neural network according to the present invention, which is expressed in m_l＝n_lTake 4 as an example.

FIG. 3 is a diagram of a second embodiment of a fine-grained image classification method based on a block convolutional neural network according to the present invention, where m is the number_l＝n_lTake 4 as an example.

Detailed Description

In a first specific embodiment, the first embodiment is described with reference to fig. 1 and fig. 2, a fine-grained image classification method based on a block convolutional neural network is provided, where the block convolutional neural network has L block convolutional layers, L is the number of layers of the current block convolutional layer, L is greater than or equal to 1 and less than or equal to L, and is initialized to L is 1; the method is realized by the following steps:

step one, for the first block convolution layer f (·; omega)^(l)) ObtainingThe input characteristic diagram is x^(l)(ii) a "·" denotes an input of a function, and may be denoted by "·" when the input is uncertain. The above-mentioned

For the convolution kernel parameters, R is a real number,

the dimension of expression is

Real matrix of (d) for representing Ω^(l)The size of (d); c. C^(l)For inputting the number of channels of the feature map, c^(l+1)In order to output the number of channels of the feature map,

and

for each convolution kernel width and height;

the input feature map

In the formula (I), the compound is shown in the specification,

and

and

and

the operation of rounding up is carried out; the range of the shrinkage factor on the width dimension and the height dimension of the theoretical receptive field is respectively as follows:

And height

And is

i＝1，…，m_lAnd

j＝1，…，n_lare all positive integers, and are not limited to the integer,

And height

i＝1，…，m_l，j＝1，…，n_lSplicing according to the original position to obtain the output characteristic diagram of the first convolution layer in the block convolution neural network

Step seven, for the L partitioned convolution layers, operating according to the steps one to six until the L partitioned convolution layers are all operatedObtaining the output characteristic diagram x of the last L-th block convolution layer^(L+1)X is to be^(L+1)Inputting the data into a full connection layer to obtain the output probability p ∈ R of fine-grained image classificationⁿAnd n is the number of categories, so that the classification of fine-grained images is realized.

Step eight, in the model training process, cross entropy L is used_CE(t, p) and the real category t optimize the output probability p of the fine-grained image classification:

L_CE(t，p)＝-ln p_t。

in a second embodiment, the present embodiment is described with reference to fig. 3, and the present embodiment is an example of a fine-grained image classification method based on a block convolutional neural network according to the first embodiment: the embodiment can simplify the operation and improve the block convolution efficiency while finishing the block convolution operation. Setting a block convolutional neural network to have L block convolutional layers, wherein L is the number of layers of the current block convolutional layer, L is more than or equal to 1 and is less than or equal to L, and initializing to L is 1;

step 1, for the first block convolution layer f (·; omega) in the block convolution neural network^(l)) And "·" represents the input to the function.

Is the parameter of its convolution kernel, R is a real number,

the dimension of expression is

Real matrix of (d) for representing Ω^(l)The size of (d); c. C^(l)Is the number of channels of the input characteristic diagram,

and

is the width and height of each convolution kernel, and obtains its input characteristic diagram

Is the output characteristic diagram, W, of the first-1 blocked convolutional layer^(l)And H^(l)Is the width and height of the input feature map;

step 2, according to the number m of blocks in each row and each column on the preset feature diagram_lAnd n_lRandomly sampling to obtain the width of the feature map block

And height

And is

i＝1，…，m_lAnd

j＝1，…，n_lare all positive integers, and are not limited to the integer,

step 3, inputting the characteristic diagram x^(l)Every other row above

Insert into

Column all zero column vector, every other row

Insert into

The rows are all zero row vectors and the row vectors,

and

is the step size of the convolution kernel in the width and height dimensions of the feature map,

obtaining a processed feature map for a round-down operation

Step 4, adopting a convolution kernel parameter omega^(l)To pair

Performing convolution to obtain a convolution output characteristic diagram

Step 5, according to the positions of the all-zero column vectors and the all-zero row vectors inserted in the step 3, outputting the feature graph by convolution

The inserted vector is removed, and the removed column is marked with

The removed row numbers are

Obtaining an output feature map of the first convolutional layer in a partitioned convolutional neural network

Step 6, for all the block convolution layers, operating according to the steps 1 to 5 until obtaining the output characteristic diagram x of the last block convolution layer (L layer)^(L+1)X is to be^(L+1)Inputting the fine-grained image classification data into a full connection layer to obtain an output probability p of fine-grained image classification;

in this embodiment modeUsing Cross Entropy (CE) L_CE(t, p) and the true class t optimize the output probability p of the fine-grained image classification.

Claims

1. A fine-grained image classification method based on a block convolutional neural network is characterized in that the block convolutional neural network is set to have L block convolutional layers, wherein L is the number of layers of the current block convolutional layers, L is more than or equal to 1 and is less than or equal to L, and the initialization is that L is equal to 1; the method is characterized in that:

the method is realized by the following steps:

and

for each convolution kernel width and height;

the dimension of expression is

A real matrix of (d);

the input feature map

step two, when l is equal to 1, setting m₁＝n₁1 is ═ 1; when in useWhen l is more than 1, calculating the number m of blocks per line on the input feature map by the following formula_lAnd the number of blocks per column n_l：

In the formula (I), the compound is shown in the specification,

and

and

and

the operation of rounding up is carried out;

step three, according to the stepThe number m of blocks in each row and column on the input feature map obtained in the second step_lAnd n_lRandomly sampling to obtain the width of the feature map block

And height

And is

And

are all positive integers, and are not limited to the integer,

And height

2. The fine-grained image classification method based on the block convolutional neural network as claimed in claim 1, wherein: step eight, adopting cross entropy L_CE(t, p) and the real category t optimize the output probability p of the fine-grained image classification:

L_CE(t，p)＝-lnp_t。

3. the fine-grained image classification method based on the block convolutional neural network as claimed in claim 1, wherein: in the second step, the range of the shrinkage factor on the width dimension and the height dimension of the theoretical receptive field is respectively as follows:

4. the fine-grained image classification method based on the block convolutional neural network as claimed in claim 1, wherein: replacing the second step with the sixth step by the following steps:

step A, setting the number m of blocks in each row and column on the output characteristic diagram_lAnd n_lRandomly sampling to obtain the width of the feature map block

And height

And is

And

are all positive integers, and are not limited to the integer,

step B, inputting a characteristic diagram x^(l)Every other row above

Insert into

Column all zero column vector, every other row

Insert into

The rows are all zero row vectors and the row vectors,

and

step sizes of the convolution kernel in the feature width and height dimensions respectively,

obtaining a processed feature map for a round-down operation

Step C, adopting the convolution kernel parameter omega obtained in the step one^(l)To pair

Directly performing convolution to obtain convolution output characteristic diagram

Step D: c, according to the positions of all zero column vectors and all zero row vectors inserted in the step C, outputting the feature graph by convolution

The inserted vector is removed, and the removed column is marked with

The removed row numbers are