CN115830402B

CN115830402B - Fine-granularity image recognition classification model training method, device and equipment

Info

Publication number: CN115830402B
Application number: CN202310140142.7A
Authority: CN
Inventors: 余鹰; 王景辉
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-09-12
Anticipated expiration: 2043-02-21
Also published as: CN115830402A

Abstract

The invention provides a training method, a device and equipment for a fine-grained image recognition classification model, wherein the method comprises the following steps: inputting the fine-grained images into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers; obtaining classification vectors obtained by learning fine-grained images by a preset number of target self-attention layers; inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and respectively carrying out loss calculation on the classification label of each target self-attention layer and a preset real label; and updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer. The invention is beneficial to mining complementary information in classification vectors of different layers and is used for classification by introducing a progressive training mechanism; the multi-scale module is also provided, the complementary communication of global information and local information is realized, and the fine-grained image classification effect is improved.

Description

Fine-granularity image recognition classification model training method, device and equipment

Technical Field

The invention relates to the technical field of model training, in particular to a fine-grained image recognition classification model training method, device and equipment.

Background

Fine-grained image classification aims at identifying sub-categories within the same parent category. For example, benz, audi, lan Songya, parrot, labrador, gold hair, etc. under the same car category, under the same bird category. The fine-granularity image classification technology is paid attention to in the aspects of face recognition, traffic vehicle recognition, intelligent retail goods, agricultural disease recognition research, endangered animal protection and the like.

However, unlike conventional image classification problems, training dataset pictures for fine-grained image classification often have a large discriminative significance only in local fine areas. Existing fine-grained image classification models fall into two general categories: strong supervision model and weak supervision model. The strong supervision model relies on fine image annotation (e.g. manual annotation frames, key point information, etc.), which is mostly obtained by expert annotation in different aspects, and this annotation work requires a lot of time and effort due to the large sample data set. In addition, the annotation information may be subjectively affected and subject to error. Recent work based on weak supervision has been attracting attention from researchers, and this approach does not require additional image labeling, i.e., image-level labels as supervisory signals. For example, the Vision transducer (visual self-attention model, viT) recently proposed by Google is used for enlarging the wonderful color in the field of computer images, and only ViT can achieve good effect on fine-grained image classification, but there is still a shortage of achieving fine-grained image classification.

Thus, many researchers are also currently proposing various ViT-based variants, all with certain success. However, the existing ViT-based work mostly migrates the ideas existing in convolutional neural networks, lacking the thinking of a unique multi-headed attention mechanism in the ViT architecture. And most of the recent ViT work is solely on picture vector (patch token) and multi-head attention mechanisms, but ignores the importance of classification vectors (class token) in classifying. Existing ViT and some ViT variants only consider the advantageous information learned by the last attention layer for classification, while ignoring the complementary information learned by other layers, which would result in some loss of information, resulting in a lack of accuracy in fine-grained image classification of the model.

Disclosure of Invention

Based on the above, the invention aims to provide a fine-grained image recognition classification model training method, a device and equipment, so as to solve at least one technical problem in the prior art.

According to the embodiment of the invention, the fine-grained image recognition classification model training method comprises the following steps:

acquiring fine-grained images for model training, inputting the fine-grained images into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers, and the fine-grained images sequentially pass through each self-attention layer so as to learn classification vectors of the fine-grained images through the self-attention layers;

Obtaining classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, wherein the target self-attention layers are positioned at the rear ends of the multi-layer self-attention layers;

inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and respectively carrying out loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer;

and updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer so as to train the fine-grained image recognition classification model.

In addition, the training method for the fine-grained image recognition classification model according to the embodiment of the invention may further have the following additional technical features:

further, the method further comprises the following steps:

calculating a final attention weight matrix of the self-attention layer after classifying vector learning on the fine-grained image according to a preset calculation rule;

determining the position of a classification target according to the final self-attention weight matrix, and intercepting a classification target area image in the fine-granularity image according to the position of the classification target;

And scaling the classified target area image to the same size as the fine-granularity image, and inputting the classified target area image into the preset network model for training so as to intensively train the fine-granularity image recognition classification model.

Further, the preset network model further includes a linear projection layer and a position coding layer, and the step of inputting the fine-grained image into the preset network model for training includes:

dividing the fine-grained image into preset sub-images according to preset dividing rules, and mapping each sub-image to a high-dimensional feature space through the linear projection layer to obtain picture vectors of each sub-image;

coding the picture vectors of each sub-image through the position coding layer to add position coding information for each picture vector, and adding an empty classification vector before the first picture vector to obtain a vector sequence;

and inputting the vector sequence into the multi-layer self-attention layer to perform classification vector learning, wherein the classification characteristic learned by each layer of self-attention layer is updated in the classification vector of the vector sequence, so as to obtain the classification vector of each layer of self-attention layer.

Further, the self-attention layer includes a plurality of attention heads, and the step of calculating a final attention weight matrix of the self-attention layer after classifying the fine-grained image according to a preset calculation rule includes:

after the fine-grained images are subjected to classification vector learning, in each attention head, attention weights of classification vectors and each picture vector in the layer are calculated respectively, and an attention weight matrix corresponding to each attention head is obtained;

and performing point multiplication calculation on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix.

Further, the calculation formula of the attention weight is:

in the formula ,is the firstlThe first of the attention headsiAttention weights of the slice picture vector and the classification vector,Qto checkThe vector of the poll is a function of the polling vector,Kin the form of a key vector,Vas a vector of values,d _k for the mapped spatial dimension of the attention header,Ttranspose the matrix; wherein the attention weight matrixAExpressed as:

wherein ,l∈1,2,…,L，i∈1,2,…,K，Lrepresenting the number of attention deficit points,Krepresenting the number of picture vectors.

Further, the step of determining the location of the classification target according to the final self-attention weight matrix includes:

Calculating the average value of all the attention weights in the final attention weight matrix;

comparing each attention weight in the final attention weight matrix with the average value, marking the attention weight larger than the average value as a first threshold value, and marking the attention weight larger than the average value as a second threshold value;

and determining the position of the classification target according to the position coding information of the target picture vector with the attention weight of the classification vector being a first threshold value.

Further, the step of respectively performing loss calculation on the classification label and the preset real label of each target self-attention layer to obtain a loss value of each target self-attention layer includes:

respectively carrying out cross entropy loss calculation on the classification labels of each target self-attention layer and preset real labels to obtain a loss value of each target self-attention layer;

the formula of the cross entropy loss calculation is as follows:

in the formula ,y ^r is the firstrThe class labels of the individual target self-attention layers,yin order to preset the actual label to be displayed,LOSS _CE (y ^r ,y) Is the firstrCross entropy loss values of the classified labels of the individual target self-attention layers and preset real labels, the preset number is 3,r∈1,2,3。

according to an embodiment of the invention, a fine-grained image recognition classification model training system comprises:

The image acquisition module is used for acquiring fine-grained images for model training, inputting the fine-grained images into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers, and the fine-grained images sequentially pass through each self-attention layer so as to learn classification vectors of the fine-grained images through the self-attention layers;

the vector acquisition module is used for acquiring classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, and the target self-attention layers are positioned at the rear ends of the multiple layers of self-attention layers;

the loss calculation module inputs the classification vector of each target self-attention layer into a preset classifier, outputs the classification label of each target self-attention layer, and respectively carries out loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer;

and the progressive training module is used for updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer so as to train the fine-granularity image recognition classification model.

The invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the fine-grained image recognition classification model training method described above.

The invention also provides fine-granularity image recognition classification model training equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the fine-granularity image recognition classification model training method is realized when the processor executes the program.

The beneficial effects of the invention are as follows: through improving the traditional ViT structure, a progressive training mechanism is introduced, classification vectors of different layers in the ViT structure are selected, the attention layer learning beneficial information of the last layer is not simply paid attention to, the importance of the classification vectors in classification is paid attention to, the learned information can be well transmitted upwards, complementary information in the classification vectors of different layers is conveniently mined and used for classification, and the precision effect of fine-grained image classification is improved.

Drawings

FIG. 1 is a picture of a gull of California provided in an embodiment of the invention;

FIG. 2 is a view of a North gull provided in an embodiment of the present invention;

FIG. 3 is a flowchart of a fine-grained image recognition classification model training method in a first embodiment of the invention;

FIG. 4 is a schematic diagram of a modified ViT model structure provided in an embodiment of the invention;

fig. 5 is a block diagram of a fine-grained image recognition classification model training system in accordance with a third embodiment of the invention.

The following detailed description will further illustrate the invention with reference to the above-described drawings.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are presented in the figures. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

It will be understood that when an element is referred to as being "mounted" on another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Fine-grained image classification aims at identifying sub-categories within the same parent category. For example, benz, audi, lan Songya, parrot, labrador, gold hair, etc. under the same car category, under the same bird category. The fine-granularity image classification technology is paid attention to in the aspects of face recognition, traffic vehicle recognition, intelligent retail goods, agricultural disease recognition research, endangered animal protection and the like. However, unlike conventional image classification problems, training dataset pictures for fine-grained image classification often have a large discriminative significance only in local fine areas. As shown in fig. 1-2, fig. 1 is a gull of california and fig. 2 is a gull of arctic. Although the sea gulls are of different types, the two types are quite similar, and are difficult for common people to distinguish by naked eyes. And the seagulls belonging to the same category are difficult to judge whether the seagulls are of the same category or not due to the problems of shooting angles, illumination, flight postures and the like. Because of the large intra-class differences and small inter-class differences of the special images, the fine-granularity image recognition is more difficult and challenging than the traditional image classification.

Recently, the Vision transducer (ViT, visual self-attention model) proposed by Google is used for enlarging the wonderful color in the field of computer images, and only ViT can achieve good effect on fine-granularity image classification, but the classification of fine-granularity images still has the defect of being realized. Thus, many researchers are also currently proposing various ViT-based variants, all with certain success. However, the existing ViT-based work mostly migrates the ideas existing in convolutional neural networks, lacking the thinking of a unique multi-headed attention mechanism in the ViT architecture. And most of the recent ViT work is solely on picture vector (latch token) and multi-head attention mechanisms, but ignores the importance of classification vector (classification token) in classification. And existing ViT and some ViT variations only consider the advantageous information learned by the last attention layer for classification, and ignore the complementary information learned by other layers, which would result in some loss of information, resulting in a lack of accuracy in fine-grained image classification of the model.

Based on the above, the invention aims to improve the traditional ViT structure and provide a set of brand-new training method for the fine-granularity image classification model, so that the fine-granularity image classification model obtained by training has better classification precision, and the classification effect of the model is improved. The embodiments will be described in detail with reference to the following specific examples.

Example 1

Referring to fig. 3, a fine-grained image recognition classification model training method according to a first embodiment of the invention is shown, wherein the fine-grained image recognition classification model training method can be implemented by software and/or hardware, and the method includes steps S01-S04.

Step S01, fine-grained images for model training are obtained and input into a preset network model for training, the preset network model comprises multiple self-attention layers, and the fine-grained images sequentially pass through each self-attention layer so as to learn classification vectors of the fine-grained images through the self-attention layers.

In this embodiment, the preset network model is specifically an improved ViT model, please refer to fig. 4, the improved ViT model includes multiple layers of self-attention layers (convermerLayer), wherein the last three layers of self-attention layers are respectively connected with MLPHead classification heads, and the three MLPHead classification heads are respectively marked as MLP1, MLP2 and MLP3 in the figure, so that the classification vector (classification) learned by the self-attention layers can output the corresponding classification result through the corresponding MLPHead classification heads.

In specific implementation, a large number of fine-grained pictures of different types can be collected, the pictures of the same type are classified into one type, and the same real label is preset for each type of picture, for example, a large number of North gull pictures and a large number of California seagull pictures are collected, the North gull pictures are classified into one type and are endowed with the real label capable of representing the North gull characteristics, and the California seagull pictures are classified into one type and are endowed with the real label representing the California seagull characteristics. And then training the improved ViT model by using different kinds of fine-grained pictures as training sets, wherein the fine-grained pictures can be sequentially input into each self-attention layer of the ViT model during training so as to learn classification vectors of the fine-grained pictures through the self-attention layers. Preferably, in the implementation, the real tag may be a label of each category, for example, the gull real tag is assigned a value of 1, the gull real tag in california is assigned a value of 2, or the real tag may be other identifying information such as a name, fine granularity feature, etc. of each category.

Step S02, obtaining classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, wherein the target self-attention layers are positioned at the rear ends of the multiple self-attention layers.

Step S03, inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and respectively carrying out loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer.

In this embodiment, the preset classifier is an MLPHead classification header.

And step S04, updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer so as to train the fine-granularity image recognition classification model.

In specific implementation, the embodiment specifically selects the last three self-attention layers as the target self-attention layers, that is, the preset number is three, then respectively carries out classification output on the classification vectors learned by the last three self-attention layers through the corresponding MLPHead classification heads to obtain classification labels of each target self-attention layer, then respectively calculates the loss values of the classification labels and the real labels of each target self-attention layer, thereby obtaining the loss values of the last three self-attention layers, and then respectively carries out iterative modification on the network parameters of the previous self-attention layers by using the loss values of each self-attention layer and adopting a back propagation mechanism to finally train a fine-granularity image recognition classification model capable of accurately carrying out fine-granularity image recognition classification. Of course, in other embodiments, other numbers and/or other locations of self-attention layers may be used for classification, such as selecting the last four self-attention layers as the target self-attention layer.

Specifically, as a preferred embodiment, the step of respectively performing loss calculation on the classification label and the preset real label of each target self-attention layer to obtain a loss value of each target self-attention layer includes:

the formula of the cross entropy loss calculation is as follows:

that is, the embodiment selects the last three self-attentive layers to be respectively connected with the MLPHead classification heads based on the structure of the traditional ViT, so as to select the beneficial information learned by the last three self-attentive layers for classification, and accordingly proposes a progressive step-by-step training mode, namely, updating network parameters by the loss values of the last three self-attentive layers through a back propagation mechanism respectively, so as to guide the model to learn multiple layers of complementary information. It should be noted that the training method in this embodiment is not simple to superimpose and counter-propagate the losses of the different layers. But rather back-propagates and updates parameters for each loss separately, which helps better coordinate the different layers of the model. In addition, the information learned by the bottom layer is transferred to the upper layer in a progressive mode, which is beneficial to model learning and convergence.

In summary, according to the training method for the fine-grained image recognition classification model in the above embodiment of the invention, by improving the traditional ViT structure and introducing a progressive training mechanism, the classification vectors of different levels in the ViT structure are selected, so that only the favorable information learned by the last attention layer is not simply paid attention to, and meanwhile, the importance of the classification vectors in classification is paid attention to, so that the learned information can be well transmitted upwards, complementary information in the classification vectors of different levels is conveniently mined and used for classification, and the precision effect of fine-grained image classification is improved.

Example two

In a second embodiment of the present invention, a training method for a fine-grained image recognition classification model is also provided, where the training method for the fine-grained image recognition classification model may be implemented by software and/or hardware, please refer to fig. 4, the modified ViT model further includes a linear projection layer (LinearProjection of Flattened Patches), a position coding layer (positioning layer), and a multi-scale module, the fine-grained image is first mapped in a high-dimensional feature space through the linear projection layer to obtain a corresponding picture vector, and then the picture vector is position coded by the position coding layer and then input into a subsequent multi-layer self-attention layer (TransformerLayer), and meanwhile, the embodiment further adds the multi-scale module based on the conventional ViT structure, as shown in fig. 4, the multi-scale module is located on the right side of the multi-layer self-attention layer, and each self-attention layer is connected with the multi-scale module. It can be understood that the multi-head attention mechanism of ViT makes the local attention mechanism of each self-attention layer pay attention to global information, and in fine-grained image recognition, some discriminative areas are often tiny local areas, so in order to better promote the model to learn some significant area information, the embodiment proposes a multi-scale module, the main function of the module is to multiply attention weights of each layer by superposition of multi-head attention of the self-attention layer and map the attention weights to original images, so as to find the local discriminative area of each self-attention layer, which corresponds to attention, from the original images, then intercept the corresponding local discriminative area, and then re-input the local discriminative area picture intercepted by each layer into the model for training, so as to help the model find the discriminative local area on the basis of learning global information, thereby realizing complementary communication between global information and local information, and improving fine-grained image classification effect. Specifically, in the present embodiment, the fine-grained image recognition classification model training method specifically includes steps S11-S16.

And S11, dividing the fine-grained image into preset sub-images according to preset dividing rules, and mapping each sub-image to a high-dimensional feature space through the linear projection layer to obtain picture vectors of each sub-image.

In a specific implementation, each fine-grained image may be equally divided into preset sub-images according to a preset division size, for example, the fine-grained image may be equally divided into 9 sub-images according to a division size of 3×3, where each sub-image corresponds to a picture vector.

And step S12, coding the picture vectors of each sub-image through the position coding layer so as to add position coding information for each picture vector, and adding an empty classification vector before the first picture vector to obtain a vector sequence.

In some alternative embodiments, the position-coding information may be position coordinate information of the sub-images over the whole fine-grained image, and since the picture segmentation rules are known, position coordinate information of each sub-image over the whole fine-grained image is also known. Or in other alternative embodiments, the position coding information may be the number of the sub-images, and specifically, each sub-image may be numbered according to the sequence of segmentation.

Step S13, inputting the vector sequence into the multi-layer self-attention layer for classifying vector learning, wherein the classifying features learned by each layer of self-attention layer are updated in the classifying vectors of the vector sequence, so as to obtain the classifying vectors of each layer of self-attention layer.

Step S14, selecting the last three self-attention layers as target self-attention layers, and obtaining classification vectors obtained by each target self-attention layer for learning the fine-grained images.

Step S15, inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and respectively carrying out cross entropy loss calculation on the classification label of each target self-attention layer and a preset real label to obtain the loss value of each target self-attention layer.

And S16, updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer so as to train the fine-granularity image recognition classification model.

In addition, while the model training is performed, the fine-grained image recognition classification model training method according to the embodiment further includes:

It should be appreciated that the reinforcement training is part of the model training corresponding to the multi-scale module, and may be performed simultaneously during the progressive training process.

The step of calculating a final attention weight matrix of the self-attention layer after classifying the fine-grained image according to a preset calculation rule specifically includes:

And performing point multiplication (i.e. matrix multiplication) on the attention weight matrix of all the attention heads to obtain the final attention weight matrix. The calculation formula of the attention weight is as follows:

in the formula ,is the firstlThe first of the attention headsiAttention weights of the slice picture vector and the classification vector,Qin order to query the vector of the vector,Kin the form of a key vector,Vas a vector of values,d _k for the mapped spatial dimension of the attention header,Ttranspose the matrix; in the specific implementation, the classification vector and the picture vector are respectively divided into three parts, one part is the query vectorQOne part is a key vectorKOne part is a value vectorVThen according to the query vector corresponding to the classification vector and the picture vectorQKey vectorKSum vectorVAnd calculating the relation degree of the classification vector and the picture vector to obtain the attention weight.

Wherein the attention weight matrixAExpressed as:

wherein ,l∈1,2,…,L，i∈1,2,…,K，Lrepresenting the number of attention deficit points,Krepresenting picture vectorsNumber of parts.

Based on the above, the step of determining the location of the classification target according to the final self-attention weight matrix specifically includes:

comparing each attention weight in the final attention weight matrix with the average value, wherein the attention weight larger than the average value is marked as a first threshold value, otherwise, is marked as a second threshold value, and when the method is implemented, the first threshold value can be set to be 1, and the second threshold value can be set to be 0;

It should be noted that, because the classification vector is a classification vector obtained by classifying and learning the whole fine-grained image from the attention layer, the location area of the classification target (such as the gull) in the whole fine-grained image is more noticed, so that the target picture vector with the attention weight of the classification vector being the first threshold value is necessarily a picture which is closer to or belongs to the location of the classification target, therefore, the location of the classification target can be determined according to the location coding information of the target picture vector with the attention weight of the classification vector being the first threshold value, and then the classification target area image where the classification target is located can be mapped into the original image to intercept the classification target, and further the model is subjected to reinforcement training through the classification target area image, so that the discrimination local area is found on the basis of learning global information, thereby realizing the complementary communication of global information and local information, and further improving the classification effect of the fine-grained image.

Compared with the traditional ViT structure and other ViT varieties, the model provided by the embodiment has at least the following advantages, and can effectively improve the performance and accuracy of fine-grained image classification tasks. The method comprises the following steps:

1) The embodiment provides a training method of a fine-grained image recognition classification model, which can perform end-to-end training and can perform training only by using picture-level labels;

2) The embodiment improves the traditional ViT structure, and by introducing progressive training and selecting classification vectors of different levels in the ViT structure, learned information can be well transferred upwards, so that complementary information in the classification vectors of different levels can be conveniently mined and used for classification;

3) The embodiment provides a multi-scale module which helps a model to learn global information and discover a discriminant local area, so that complementary communication between the global information and the local information is realized, and the fine-grained image classification effect is improved.

Example III

In another aspect, please refer to fig. 5, which shows a fine-grained image recognition classification model training system according to a third embodiment of the invention, the fine-grained image recognition classification model training system includes:

an image acquisition module 11, configured to acquire fine-grained images for model training, and input the fine-grained images into a preset network model for training, where the preset network model includes multiple self-attention layers, and the fine-grained images sequentially pass through each self-attention layer, so as to learn classification vectors of the fine-grained images through the self-attention layers;

The vector acquisition module 12 is configured to acquire classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, where the target self-attention layers are located at the rear ends of the multiple self-attention layers;

the loss calculation module 13 inputs the classification vector of each target self-attention layer into a preset classifier, outputs the classification label of each target self-attention layer, and respectively calculates the loss of the classification label of each target self-attention layer and a preset real label to obtain the loss value of each target self-attention layer;

the progressive training module 14 is configured to update network parameters according to the loss value of each target self-attention layer through a back propagation mechanism, so as to train the fine-grained image recognition classification model.

Further, in some alternative embodiments of the invention, the system further comprises:

the multi-scale training module is used for calculating a final attention weight matrix of the self-attention layer after classifying vector learning is carried out on the fine-grained images according to a preset calculation rule; determining the position of a classification target according to the final self-attention weight matrix, and intercepting a classification target area image in the fine-granularity image according to the position of the classification target; and scaling the classified target area image to the same size as the fine-granularity image, and inputting the classified target area image into the preset network model for training so as to intensively train the fine-granularity image recognition classification model.

Further, in some optional embodiments of the present invention, the preset network model further includes a linear projection layer and a position coding layer, and the image obtaining module 11 is further configured to divide the fine-grained image into preset sub-images according to a preset dividing rule, and map each sub-image to a high-dimensional feature space through the linear projection layer, so as to obtain a picture vector of each sub-image; coding the picture vectors of each sub-image through the position coding layer to add position coding information for each picture vector, and adding an empty classification vector before the first picture vector to obtain a vector sequence; and inputting the vector sequence into the multi-layer self-attention layer to perform classification vector learning, wherein the classification characteristic learned by each layer of self-attention layer is updated in the classification vector of the vector sequence, so as to obtain the classification vector of each layer of self-attention layer.

Further, in some optional embodiments of the present invention, the self-attention layer includes a plurality of attention heads, and the multi-scale training module is further configured to, after performing classification vector learning on the fine-grained image, respectively calculate, in each of the attention heads, an attention weight of a classification vector and each picture vector in the layer, so as to obtain an attention weight matrix corresponding to each of the attention heads; and performing point multiplication calculation on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix.

The calculation formula of the attention weight is as follows:

in the formula ,is the firstlThe first of the attention headsiAttention weights of the slice picture vector and the classification vector,Qin order to query the vector of the vector,Kin the form of a key vector,Vas a vector of values,d _k for the mapping space dimension of the attention head, T is the matrix transposition; wherein the attention weight matrixAExpressed as:

Further, in some optional embodiments of the present invention, the multi-scale training module is further configured to calculate an average value of all attention weights in the final attention weight matrix; comparing each attention weight in the final attention weight matrix with the average value, marking the attention weight larger than the average value as a first threshold value, and marking the attention weight larger than the average value as a second threshold value; and determining the position of the classification target according to the position coding information of the target picture vector with the attention weight of the classification vector being a first threshold value.

Further, in some optional embodiments of the present invention, the loss calculation module 13 is further configured to perform cross entropy loss calculation on the classification label and the preset real label of each target self-attention layer, so as to obtain a loss value of each target self-attention layer;

The formula of the cross entropy loss calculation is as follows:

the present invention also proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a fine-grained image recognition classification model training method as described above.

The invention also provides fine-granularity image recognition classification model training equipment, which comprises a processor, a memory and a computer program stored on the memory and capable of running on the processor, wherein the fine-granularity image recognition classification model training method is realized when the processor executes the computer program.

The fine-granularity image recognition and classification model training device can be a computer, a server, a camera device and the like. The processor may in some embodiments be a central processing unit (CentralProcessing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing data stored in a memory, e.g. executing an access restriction program or the like.

Wherein the memory comprises at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory may in some embodiments be an internal storage unit of a fine-grained image recognition classification model training device, such as a hard disk of the fine-grained image recognition classification model training device. The memory may also be an external storage device of the fine-grain image recognition classification model training apparatus in other embodiments, for example, a plug-in hard disk, a Smart Media Card (SMC), a secure digital (SecureDigital, SD) Card, a flash Card (FlashCard), etc. that are provided on the fine-grain image recognition classification model training apparatus. Further, the memory may also include both internal storage units and external storage devices of the fine-grained image recognition classification model training apparatus. The memory can be used for storing application software and various data installed in the fine-grained image recognition classification model training equipment, and can be used for temporarily storing data which are output or are to be output.

Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A fine-grained image recognition classification model training method, the method comprising:

acquiring fine-grained images for model training, inputting the fine-grained images into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers and a multi-scale module, each self-attention layer is respectively connected with the multi-scale module, and the fine-grained images sequentially pass through each self-attention layer so as to learn classification vectors of the fine-grained images through the self-attention layers;

according to the loss value of each target self-attention layer, updating network parameters through a back propagation mechanism respectively so as to train the fine-grained image recognition classification model;

furthermore, the method comprises the following steps:

scaling the classified target area image to the same size as the fine-granularity image, and inputting the classified target area image into the preset network model for training so as to intensively train the fine-granularity image recognition classification model;

the step of inputting the fine-grained image into the preset network model for training comprises the following steps of:

the picture vector of each sub-image is encoded through the position encoding layer, so that position encoding information is added for each picture vector, an empty classification vector is added in front of a first picture vector, and a vector sequence is obtained, wherein the position encoding information is position coordinate information of the sub-image in the whole fine-granularity image or the number of the sub-image;

inputting the vector sequence into the multi-layer self-attention layer to perform classification vector learning, wherein the classification characteristic learned by each layer of self-attention layer is updated in the classification vector of the vector sequence to obtain the classification vector of each layer of self-attention layer;

the step of calculating a final attention weight matrix of the self-attention layer after classifying the fine-grained image according to a preset calculation rule comprises the following steps:

Performing point multiplication calculation on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix;

the step of determining the position of the classification target according to the final self-attention weight matrix comprises the following steps:

2. The fine-grained image recognition classification model training method according to claim 1, wherein the attention weight is calculated according to the following formula:

in the formula ,is the firstlThe first of the attention headsiAttention weights of the slice picture vector and the classification vector,Qin order to query the vector of the vector,Kin the form of a key vector,Vas a vector of values,d _k for the mapped spatial dimension of the attention header,Ttranspose the matrix; wherein the attention weight matrixAExpressed as:

wherein ,l∈1,2,…,L，i∈1,2,…,K，Lrepresenting the number of attention deficit points, KRepresenting the number of picture vectors.

3. The training method of fine-grained image recognition classification model according to claim 1, wherein the step of respectively performing loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer comprises the following steps:

the formula of the cross entropy loss calculation is as follows:

4. a fine-grained image recognition classification model training system, the system comprising:

the system comprises an image acquisition module, a model training module and a classification vector learning module, wherein the image acquisition module is used for acquiring fine-grained images for model training and inputting the fine-grained images into a preset network model for training, the preset network model comprises a plurality of self-attention layers and a multi-scale module, each self-attention layer is respectively connected with the multi-scale module, and the fine-grained images sequentially pass through each self-attention layer so as to learn classification vectors of the fine-grained images through the self-attention layers;

the progressive training module is used for updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer so as to train the fine-granularity image recognition classification model;

furthermore, the system further comprises:

the multi-scale training module is used for calculating a final attention weight matrix of the self-attention layer after classifying vector learning is carried out on the fine-grained images according to a preset calculation rule; determining the position of a classification target according to the final self-attention weight matrix, and intercepting a classification target area image in the fine-granularity image according to the position of the classification target; scaling the classified target area image to the same size as the fine-granularity image, and inputting the classified target area image into the preset network model for training so as to intensively train the fine-granularity image recognition classification model;

The image acquisition module is further used for dividing the fine-grained image into preset sub-images according to preset dividing rules, and mapping each sub-image to a high-dimensional feature space through the linear projection layer to obtain picture vectors of each sub-image; coding the picture vectors of each sub-image through the position coding layer to add position coding information for each picture vector, and adding an empty classification vector before the first picture vector to obtain a vector sequence; inputting the vector sequence into the multi-layer self-attention layer to perform classification vector learning, wherein the classification characteristic learned by each layer of self-attention layer is updated in the classification vector of the vector sequence to obtain the classification vector of each layer of self-attention layer;

the self-attention layer comprises a plurality of attention heads, and the multi-scale training module is further used for respectively calculating the attention weights of the classification vectors and each picture vector in the self-attention layer in each attention head after classifying vector learning is carried out on the fine-grained images to obtain an attention weight matrix corresponding to each attention head; performing point multiplication calculation on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix;

The multi-scale training module is further used for calculating an average value of all attention weights in the final attention weight matrix; comparing each attention weight in the final attention weight matrix with the average value, marking the attention weight larger than the average value as a first threshold value, and marking the attention weight larger than the average value as a second threshold value; and determining the position of the classification target according to the position coding information of the target picture vector with the attention weight of the classification vector being a first threshold value.

5. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the fine-grained image recognition classification model training method of any of claims 1-3.

6. A fine-grained image recognition classification model training apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the fine-grained image recognition classification model training method of any of claims 1-3 when the program is executed by the processor.