CN115131607A

CN115131607A - Image classification method and device

Info

Publication number: CN115131607A
Application number: CN202210681224.8A
Authority: CN
Inventors: 祖宝开; 李建强; 王宏远; 李亚芳; 白建川
Original assignee: Beijing University of Technology; CETC 15 Research Institute
Current assignee: Beijing University of Technology; CETC 15 Research Institute
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-30

Abstract

The invention provides an image classification method and device, wherein the image classification method comprises the following steps: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual Transformer model, and outputting a coding vector sequence corresponding to an image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 coding block in a residual manner; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector into a classifier of the visual transform model to obtain a classification result of the image to be classified. By the image classification method, the accuracy of image classification can be improved.

Description

Image classification method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an image classification method and device.

Background

With the development of artificial intelligence technology, a Vision Transformer (ViT) model is widely applied in the field of image processing. ViT models can achieve superior performance for many visual tasks, such as image classification using ViT models.

Since the ViT model is used for image classification, a large-scale data set is needed for training the ViT model. Therefore, people focus on designing more complex model architectures with more layers to improve the processing efficiency of ViT models on data.

However, as the ViT model layer number increases, the accuracy of image classification is greatly reduced. Therefore, how to improve the accuracy of image classification based on the ViT model is an important issue to be solved in the industry.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides an image classification method and device.

The invention provides an image classification method, which comprises the following steps:

acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence;

inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L;

determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.

Optionally, the coding block further includes a first normalization layer, a second normalization layer, and a feed-forward layer;

the inputting the embedded input vector sequence into the encoder and outputting the coding vector sequence corresponding to the image to be classified includes:

inputting the embedded input vector sequence into the first normalization layer for normalization processing to generate a processed embedded input vector sequence; inputting the processed embedded input vector sequence into the residual multi-head self-attention layer to generate a first vector sequence;

inputting the first vector sequence into the second normalization layer and the feedforward layer to generate a second vector sequence;

and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.

Optionally, the residual multi-head self-attention layer includes a first linear layer, a weighted residual scaling dot product attention layer, and a second linear layer, where the weighted residual scaling dot product attention layer includes H self-attention heads, a self-attention layer output graph of an mth weighted residual scaling dot product attention layer is connected with a self-attention graph of an M-1 weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;

the inputting the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate a first vector sequence, including:

inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified;

inputting the index vector sequence, the key vector sequence and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads;

and splicing self-attention results corresponding to the H attention heads, and inputting the splicing results into the second linear layer to generate the first vector sequence.

Optionally, the inputting the index vector sequence, the key vector sequence, and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention head, including:

for each self-attention head, generating an Mth-layer self-attention layer output map based on the index vector sequence and the key vector sequence;

generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram;

and generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.

Optionally, the preprocessing the image to be classified to obtain an embedded input vector sequence includes:

splitting the image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks;

and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate the embedding input vector sequence.

Optionally, the determining, based on the coding vector sequence, a feature vector corresponding to the image to be classified includes:

and determining the category embedding vector as a feature vector corresponding to the image to be classified.

Optionally, the classifier is obtained by training with a cross entropy loss function.

The present invention also provides an image classification apparatus, comprising:

the system comprises a preprocessing module, a classification module and a classification module, wherein the preprocessing module is used for acquiring an image to be classified and preprocessing the image to be classified to obtain an embedded input vector sequence;

the encoding module is used for inputting the embedded input vector sequence into an encoder of a visual transform model and outputting an encoding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform encoding blocks, each encoding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth encoding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th encoding block in a residual mode, L is a positive integer, and M is greater than or equal to 2 and less than or equal to L;

and the classification module is used for determining a feature vector corresponding to the image to be classified based on the coding vector sequence, inputting the feature vector to a classifier of a visual Transformer model, and obtaining a classification result of the image to be classified.

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the image classification method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image classification method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the image classification method as described in any one of the above.

The image classification method and the device provided by the invention input the embedded input vector sequence into an encoder of a visual transform model comprising a residual multi-head self-attention layer, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of an Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of an M-1 th coding block in a residual mode. Based on a self-attention diagram and a self-attention layer output diagram which are connected in a residual error mode, information exchange among coding blocks can be effectively enhanced, the network degradation problem of a visual transform model is relieved, an embedded input vector sequence is input into the visual transform model, a more comprehensive coding vector sequence corresponding to an image to be classified can be obtained, the image is classified based on the coding vector sequence, and the accuracy of image classification can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of an image classification method according to the present invention;

FIG. 2 is a schematic structural diagram of a visual transform model provided in the present invention;

FIG. 3 is a second flowchart of the image classification method according to the present invention;

FIG. 4 is a schematic structural diagram of an image classification apparatus provided in the present invention;

fig. 5 is a schematic physical structure diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the development of artificial intelligence technology, the ViT model is widely applied in the field of image processing. Unlike Convolutional Neural Networks (CNN), which rely on convolution to process local features, the ViT model uses a self-attention mechanism to establish the relationship between image block embedding Tokens, and this property of aggregating global information greatly increases the performance of the ViT model. The Transformer can achieve good performance on many visual tasks, including image classification, image enhancement, object detection, video processing, and the like.

However, while the ViT model is able to build global information, and is more flexible than the CNN model in learning image representations, it also results in the ViT model requiring larger-scale datasets for pre-training due to the larger capacity of the ViT model. Therefore, researchers have improved the data efficiency of the ViT model by designing more complex network architectures or training methods.

Notably, the self-attention (SA) mechanism of the ViT model is a key factor in the ability of the ViT model to aggregate global information. In the self-attention SA module of the ViT model, each Token is updated according to the self-attention map aggregating the features of all Tokens, and in this way, information can be sufficiently exchanged between Tokens, thereby providing a strong expression capability.

However, as the number of layers of the ViT model increases, the ViT model has a network degradation problem, and the accuracy of image classification based on the ViT model is greatly reduced.

Based on the above problems, in order to enhance information exchange between ViT model layers and alleviate the problem of network degradation, and further improve the accuracy of image classification based on ViT model, the invention provides an image classification method, which can effectively enhance information exchange between ViT model layers, alleviate the problem of network degradation of a visual Transformer model, and further effectively improve the accuracy of image classification.

The image classification method provided by the present invention is described in detail below with reference to fig. 1 to 3.

Referring to fig. 1, fig. 1 is a schematic flow chart of an image classification method provided by the present invention, and specifically includes steps 101 to 103.

Step 101, obtaining an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence.

Specifically, in this embodiment, the image to be classified refers to an image that needs to be classified, and the format of the image to be classified may be multiple, for example, a jpg format, a png format, a tif format, or a pdf format.

After the image to be classified is acquired, preprocessing is carried out on the image to be classified to obtain an embedded input vector sequence, wherein the embedded input vector sequence is a sequence input to a visual Transformer model.

Optionally, the image to be classified is preprocessed to obtain an embedded input vector sequence, which may be specifically implemented in the following manner:

Specifically, in this embodiment, after the image to be classified is acquired, the image to be classified needs to be split into a plurality of image blocks, and an image block embedded vector sequence is generated based on the image blocks.

In practice, for example, a dimension of x ∈ R is given ^H×W×C The image to be classified is firstly split into N P × P square image patches, and the patches are reshaped into a sequence, that is, each patch can be expressed as

Wherein H, W is the height and width of the image to be classified, respectively, i.e. (H, W) is the resolution of the image to be classified; (P, P) is the resolution of each patch; c is the number of channels of the image (e.g., C — 3); n is the number of latches, and can be represented by the following formula (1):

N＝HW/P ² (1)

after reshaping the N patches into a sequence, the corresponding sequence of the patches needs to be mapped using the linear projection layer

Dimension reduction is carried out, i.e. the dimension is P ² C-mapping to D-dimension to generate image block embedded vector sequence x _p E∈R ^N×D 。

After an image block embedded vector sequence is generated, a category embedded vector and a position embedded vector need to be added to the image block embedded vector sequence respectively, and then an embedded input vector sequence is generated, wherein the category embedded vector is used for classifying images to be classified, and the position embedded vector is used for representing spatial position information between image blocks.

For example, for a given length-N image block embedded vector sequence, a one-dimensional learnable embedded vector needs to be added at the head position of the image block embedded vector sequence

Further, as a class-embedded vector for classification, an image block embedded vector sequence to which a class-embedded vector is added is generated, and in this case, the image block embedded vector is R ^(N+1)×D Wherein, the total length of the image block embedding vector is N + 1; the learnable embedded vector is randomly initialized at training.

It should be noted that embedding this class into a vector

Output characteristics after input to visual Transformer model

For image representation, that is to say, will

A classifier (also called Classification Head) input to the visual Transformer model can classify images.

When the image block embedding vector sequence to which the category embedding vector is added is generated, a position embedding vector needs to be added to the image block embedding vector sequence, and an embedding input vector sequence is generated.

Specifically, a position embedding vector E of an image block may be added in an image block embedding vector sequence _pos To preserve spatial position information between the input image blocks. Here, the image block embedding vector and the position embedding vector may be added element by using standard learnable 1-D position coding.

After a category embedding vector and a position embedding vector are respectively added to the image block embedding vector sequence, an embedding input vector sequence is further generated.

I.e. embedding a sequence of vectors from an image block

Class embedding vector x _class Position-embedded vector E _pos Constructing an embedded input vector sequence z ₀ In particular, z ₀ Can be expressed by the following way:

wherein the content of the first and second substances,

E _pos ∈R ^(N+1)×D 。

in the above embodiment, a class embedding vector is added to the image block embedding vector sequence, so that the classification of the image to be classified can be realized; the position embedded vectors are added according to the image block embedded vector sequence, and the spatial position information among the image blocks can be represented, so that the accuracy of image classification is improved.

Step 102, inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-headed self-attention layer, a self-attention layer output diagram corresponding to the residual multi-headed self-attention layer of the mth coding block is connected with a self-attention diagram corresponding to the residual multi-headed self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, M is greater than or equal to 2, and is less than or equal to L.

Specifically, in this embodiment, after the image to be classified is preprocessed to obtain the embedded input vector sequence, the embedded input vector sequence needs to be input to an encoder of the visual Transformer model, so as to output the encoding vector sequence corresponding to the image to be classified.

It should be noted that the encoder of the visual Transformer model includes L transform coding blocks, each coding block includes a residual multi-headed self-attention layer, and a self-attention layer output graph corresponding to the residual multi-headed self-attention layer of the mth coding block is connected with a self-attention graph corresponding to the residual multi-headed self-attention layer of the M-1 coding block in a residual manner.

In practical application, residual connection is an effective strategy in a deep network, and the connection mode can enhance information exchange between model layers and reduce the problem of network degradation in the process of deep network learning.

Specifically, referring to fig. 2, fig. 2 is a schematic structural diagram of a visual transform model provided in the present invention.

In fig. 2, (a) shows a schematic structural diagram of a visual Transformer model, in (a), the visual Transformer model is divided into a Transformer encoder and a classifier, wherein the encoder includes L transform coding blocks, each coding block includes a residual multi-headed self-attention layer (MSA) and a feedforward neural network layer (i.e. feedforward layer FNN), the residual multi-headed self-attention layer and the feedforward neural network layer both have a normalization layer connected in a residual manner, it is noted that the FNN layer includes two fully-connected layers (FCs), a first FC is used for transforming a characteristic dimension, a second FC is used for restoring the characteristic dimension to a dimension before change, and a middle non-Linear activation function adopts a Gaussian Error Linear Unit activation function (GELU).

Optionally, in a possible implementation manner of the embodiment of the present invention, the encoder of the visual Transformer model further includes a first normalization layer, a second normalization layer, and a feedforward layer, and particularly, as shown in fig. 2 (a), the first normalization layer is connected to the residual multi-headed self-attention layer, and the second normalization layer is connected to the feedforward layer.

The inputting of the embedded input vector sequence into the encoder and the outputting of the encoded vector sequence corresponding to the image to be classified may be specifically implemented by:

Specifically, in this embodiment, as shown in (a) in fig. 2, after a category embedding vector and a position embedding vector are added to an image block embedding vector sequence respectively to generate an embedding input vector sequence, the embedding input vector sequence is input to a first normalization layer in a visual Transformer model encoding block to be normalized, so as to generate a processed embedding input vector sequence.

After the processed embedded input vector sequence is generated, the embedded input vector sequence is input to a residual multi-head self-attention layer to generate a first vector sequence; and finally, inputting the second vector sequence into the first normalization layer for L times of loop iteration to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.

That is, when an embedded input vector sequence is input into an encoder of a visual Transformer model for encoding, the embedded input vector sequence is input into a first normalization layer of a first encoding block of the encoder, and a processed embedded input vector sequence is generated; then inputting the processed embedded input vector sequence into a residual multi-head self-attention layer to generate a first vector sequence; then inputting the first vector sequence into a second normalization layer and a feedforward layer;

then, the output of the feedforward layer of the first coding block is used as the input of the first normalization layer of the second coding block, and the output of the feedforward layer of the second coding block is used as the input of the first normalization layer of the third coding block; and sequentially and iteratively executing N times until the output of the feedforward layer of the L-1 th coding block is used as the input of the first normalization layer of the L Transformer coding block, so that the coding vector sequence corresponding to the image to be classified output by the feedforward layer of the L Transformer coding block can be obtained.

In the above embodiment, the embedded input vector sequence is input into the coding block, and the coding block is subjected to loop iteration, so that information exchange between the coding blocks in the encoder can be effectively enhanced, the network degradation problem of the visual Transformer model is alleviated, and the embedded input vector sequence is input into the visual Transformer model, so that a more comprehensive coding vector sequence corresponding to the image to be classified can be obtained.

Optionally, in a possible implementation manner of the embodiment of the present invention, the residual multi-head self-attention layer includes a first linear layer, a weighted residual scaling dot product attention layer, and a second linear layer, where the weighted residual scaling dot product attention layer includes H self-attention heads, a self-attention layer output graph of an mth weighted residual scaling dot product attention layer is connected to an M-1 weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;

the inputting of the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate a first vector sequence may specifically be implemented in the following manner:

Specifically, in this embodiment, the process of inputting the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate the first vector sequence may specifically refer to (b) in fig. 2:

when the processed embedded input vector sequence is input into a residual multi-header self-attention layer, the processed embedded input vector sequence is input into a first linear layer for linear transformation, and an index vector sequence (Query, Q), a Key vector sequence (Key, K) and a Value vector sequence (Value, V) corresponding to an image to be classified are generated, wherein weight coefficients corresponding to Q, K, V are respectively W _i ^Q ,W _i ^K ,W _i ^V 。

Inputting Q, K, V into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads; the weighted residual error scaling dot product attention layer comprises H self-attention heads, a self-attention layer output graph of the Mth weighted residual error scaling dot product attention layer is connected with the M-1 th weighted residual error scaling dot product attention layer in a residual error mode, and H is a positive integer;

specifically, the output of the ith self-attention head can be expressed by the following formula (2):

head _i ＝WRSA(QWi _i ^Q ,KW _i ^K ,VW _i ^V ) (2)

wherein, i is 1,2,. and H; WRSA characterizes weighted residual self-attention outcomes (also known as weighted residual self-attention diagrams); the parameter W of each Q, K, V linear transformation is different.

After Q, K, V is input to the weighted residual scaling dot product attention layer to generate the self-attention results corresponding to the H self-attention heads, the self-attention results corresponding to the H self-attention heads need to be spliced, and the spliced result is input to the second linear layer to generate the first vector sequence.

Specifically, the stitching of the self-attention results corresponding to the H attention heads can be represented by the following formula (3):

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (3)

wherein w ^O Is a weight parameter for the second linear layer.

In the above embodiment, based on the self-attention diagram and the self-attention layer output diagram connected in a residual manner in the weighted residual scaling dot product attention layer, information exchange between coding blocks can be effectively enhanced, and the network degradation problem of the visual Transformer model can be alleviated; meanwhile, the weighted residual scaling dot product attention layer comprises H self-attention heads, and the visual transform model can learn related information of different spaces of the image to be classified by utilizing the H self-attention heads, so that the accuracy of image classification is improved.

Optionally, in a possible implementation manner of the embodiment of the present invention, the index vector sequence, the key vector sequence, and the value vector sequence after linear transformation are input into the weighted residual scaling dot product attention layer, so as to generate H self-attention results corresponding to the self-attention heads, which may specifically be implemented in the following manner:

Specifically, in this embodiment, the process of inputting Q, K, V into the weighted residual scaling dot product attention layer to generate the self-attention results corresponding to the H self-attention heads may specifically refer to (c) in fig. 2:

for each self-attention head, an mth layer self-attention layer output map is first generated based on Q and K.

Specifically, firstly, tensor multiplication is carried out on Q and K, and then scaling is carried out; normalizing the scaled Q, K by using a normalization index function to generate an M-th layer self-attention layer output map Att' ^(M) ；

After the mth layer self-attention layer output map is generated, a target self-attention map needs to be generated based on the M-1 th layer self-attention map and the mth layer self-attention layer output map.

Specifically, map Att 'is output at the self-attention layer where Mth layer is generated' ^(M) After that, the M-1 layer is subjected to self-attention map Att ^(M-1) And Att' ^(M) Connected in a residual manner, and is connected with Att' ^(M) Carrying out weighted summation to obtain the final target self-attention diagram Att of the Mth layer ^(M) ；

And finally, generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.

Specifically, the target self-attention map Att needs to be set ^(M) And carrying out tensor multiplication on the value vector sequence V, and further generating a self-attention result corresponding to each self-attention head. Wherein, the self-attention result corresponding to each self-attention head can be represented by the following formula (4):

WRSA(Q,K,V)＝(Att ^(M) +α _M Att ^(M-1) )V (4)

wherein the content of the first and second substances,

d _k the dimension of K is used to play a role in regulation, so that the inner product of Q and K is not too large, and a more stable gradient can be achieved in the training process; alpha is alpha _M Is a learnable constant, for alpha _M At the time of initialization, α _M ＝1。

In the above embodiment, the self-attention map Att connected in a residual manner in the attention layer is scaled based on the weighted residual error ^(M-1) And outputting map Att 'from attention layer' ^(M) The coding block can be effectively enhancedThe information exchange between the two modules relieves the network degradation problem of the visual Transformer model, and the embedded input vector sequence is input into the visual Transformer model, so that a more comprehensive coding vector sequence corresponding to the image to be classified can be obtained.

And 103, determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual Transformer model to obtain a classification result of the image to be classified.

Specifically, in this embodiment, after the embedded input vector sequence is input to the encoder of the visual Transformer model to obtain the coded vector sequence corresponding to the image to be classified, the feature vector corresponding to the image to be classified needs to be determined based on the coded vector sequence, where the feature vector is a vector used for classifying the image to be classified.

Optionally, in a possible implementation manner of the embodiment of the present invention, the determining, based on the coded vector sequence, the feature vector corresponding to the image to be classified may specifically be implemented in the following manner:

Specifically, in the present embodiment, the class embedding vector is determined as the feature vector corresponding to the image to be classified.

It can be understood that the category embedding vector is determined as the feature vector corresponding to the image to be classified, because the category embedding vector in the coding vector sequence corresponding to the image to be classified is learnable, and the vector contains all feature information of the image to be classified.

Therefore, in practical application, the class embedding vector can be determined as the feature vector corresponding to the image to be classified.

After determining the feature vector corresponding to the image to be classified, inputting the feature vector into a classifier of the visual Transformer model to obtain a classification result of the image to be classified.

Specifically, the classifier of the transform model adds a normalization layer and two full-join layers. After the feature vector corresponding to the image to be classified is determined, the feature vector is input into a trained classifier, and then a classification result of the image to be classified is obtained.

Optionally, in a possible implementation manner of the embodiment of the present invention, the classifier is obtained by training using a cross entropy loss function.

Specifically, in this embodiment, the feature vector corresponding to the image to be classified is input to the classifier, and the classifier is trained by using the cross entropy loss function until the loss value reaches the preset threshold value, which indicates that the training of the classifier is completed.

In practical applications, the cross entropy loss function can be expressed by the following equation (5):

wherein L represents a cross entropy loss value, y _i A label representing a sample i, the positive class being 1 and the negative class being 0; p is a radical of _i Indicating the probability that sample i is predicted as a positive class.

In the above embodiment, the feature vector corresponding to the image to be classified is determined based on the coding vector sequence, and the feature vector is input to the classifier of the visual Transformer model, so that the accuracy of image classification can be effectively improved.

The image classification method provided by the invention inputs an embedded input vector sequence into an encoder of a visual transform model comprising a residual multi-head self-attention layer, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of an Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of an M-1 th coding block in a residual mode. Based on a self-attention diagram and a self-attention layer output diagram which are connected in a residual error mode, information exchange among coding blocks can be effectively enhanced, the network degradation problem of a visual transform model is relieved, an embedded input vector sequence is input into the visual transform model, a more comprehensive coding vector sequence corresponding to an image to be classified can be obtained, the image is classified based on the coding vector sequence, and the accuracy of image classification can be effectively improved.

Referring to fig. 3, fig. 3 is a second schematic flow chart of the image classification method provided by the present invention, which specifically includes steps 301 to 310:

step 301, obtaining an image to be classified.

Step 302, splitting an image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks; and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate an embedding input vector sequence.

And step 303, inputting the embedded input vector sequence into a first normalization layer in a visual Transformer model coding block for normalization processing, and generating a processed embedded input vector sequence.

In the residual multi-head self-attention layer, step 304-308 is performed:

and step 304, inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified.

And 305, generating an M-th layer self-attention layer output graph for each self-attention head in the weighted residual scaling dot product attention layer based on the index vector sequence and the key vector sequence, wherein M is greater than or equal to 2 and less than or equal to L.

And step 306, generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram.

And 307, generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.

And 308, splicing self-attention results corresponding to the H attention heads, inputting the spliced results into a second linear layer, and generating a first vector sequence.

Step 309, inputting the first vector sequence into a second normalization layer and a feedforward layer to generate a second vector sequence; and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.

Step 310, determining the category embedded vector as a feature vector corresponding to the image to be classified; and inputting the characteristic vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.

The following describes the image classification apparatus provided by the present invention, and the image classification apparatus described below and the image classification method described above may be referred to in correspondence with each other.

Fig. 4 is a schematic structural diagram of an image classification apparatus 400 provided in the present invention.

The pre-processing module 401 is configured to obtain an image to be classified, and pre-process the image to be classified to obtain an embedded input vector sequence;

an encoding module 402, configured to input the embedded input vector sequence to an encoder of a visual transform model, and output an encoded vector sequence corresponding to the image to be classified, where the encoder includes L transform coding blocks, each coding block includes a residual multi-headed self-attention layer, a self-attention layer output graph corresponding to the residual multi-headed self-attention layer of an mth coding block is connected to a self-attention graph corresponding to the residual multi-headed self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, and M is greater than or equal to 2 and less than or equal to L;

a classifying module 403, configured to determine a feature vector corresponding to the image to be classified based on the coding vector sequence, and input the feature vector to a classifier of a visual Transformer model to obtain a classification result of the image to be classified.

The image classification device provided by the invention inputs an embedded input vector sequence into an encoder of a visual Transformer model comprising a residual multi-head self-attention layer, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of an Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of an M-1 th coding block in a residual mode. Based on a self-attention diagram and a self-attention layer output diagram which are connected in a residual error mode, information exchange among coding blocks can be effectively enhanced, the problem of network degradation of a visual Transformer model is relieved, an embedded input vector sequence is input into the visual Transformer model, a more comprehensive coding vector sequence corresponding to an image to be classified can be obtained, the image is classified based on the coding vector sequence, and the accuracy of image classification can be effectively improved.

an encoding module 402, further configured to:

Optionally, the residual multi-head self-attention layer includes a first linear layer, a weighted residual scaling dot product attention layer, and a second linear layer, where the weighted residual scaling dot product attention layer includes H self-attention heads, a self-attention layer output diagram of an mth weighted residual scaling dot product attention layer is connected with a self-attention diagram of an M-1 weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;

an encoding module 402, further configured to:

Optionally, the encoding module 402 is further configured to:

for each self-attention head, generating an M-th layer self-attention layer output graph based on the index vector sequence and the key vector sequence;

Optionally, the preprocessing module 401 is further configured to:

Optionally, the classification module 403 is further configured to:

Fig. 5 is a schematic physical structure diagram of an electronic device 500 provided in the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform an image classification method comprising: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coded vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the image classification method provided by the above methods, the method comprising: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an image classification method provided by the above methods, the method comprising: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image classification method, comprising:

inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coded vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L;

2. The image classification method according to claim 1, wherein the coding block further includes a first normalization layer, a second normalization layer, and a feed-forward layer;

3. The image classification method according to claim 2, wherein the residual multi-head self-attention layer comprises a first linear layer, a weighted residual scaling dot product attention layer and a second linear layer, wherein the weighted residual scaling dot product attention layer comprises H self-attention heads, a self-attention layer output graph of an Mth weighted residual scaling dot product attention layer is connected with a self-attention graph of an M-1 th weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;

4. The image classification method according to claim 3, wherein the step of inputting the index vector sequence, the key vector sequence, and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads comprises:

5. The image classification method according to claim 1, wherein the preprocessing the image to be classified to obtain an embedded input vector sequence comprises:

6. The image classification method according to claim 5, wherein the determining the feature vector corresponding to the image to be classified based on the coding vector sequence comprises:

7. The image classification method according to claim 1, characterized in that the classifier is trained using a cross entropy loss function.

8. An image classification apparatus, comprising:

the image classification device comprises a preprocessing module, a classification module and a classification module, wherein the preprocessing module is used for acquiring an image to be classified and preprocessing the image to be classified to obtain an embedded input vector sequence;

the encoding module is used for inputting the embedded input vector sequence into an encoder of a visual transform model and outputting an encoded vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform encoding blocks, each encoding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth encoding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th encoding block in a residual mode, L is a positive integer, M is more than or equal to 2 and is less than or equal to L;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image classification method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the image classification method according to any one of claims 1 to 7.

11. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the image classification method according to any one of claims 1 to 7.