CN115131607A - Image classification method and device - Google Patents

Image classification method and device Download PDF

Info

Publication number
CN115131607A
CN115131607A CN202210681224.8A CN202210681224A CN115131607A CN 115131607 A CN115131607 A CN 115131607A CN 202210681224 A CN202210681224 A CN 202210681224A CN 115131607 A CN115131607 A CN 115131607A
Authority
CN
China
Prior art keywords
vector sequence
image
attention
self
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210681224.8A
Other languages
Chinese (zh)
Inventor
祖宝开
李建强
王宏远
李亚芳
白建川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
CETC 15 Research Institute
Original Assignee
Beijing University of Technology
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology, CETC 15 Research Institute filed Critical Beijing University of Technology
Priority to CN202210681224.8A priority Critical patent/CN115131607A/en
Publication of CN115131607A publication Critical patent/CN115131607A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention provides an image classification method and device, wherein the image classification method comprises the following steps: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual Transformer model, and outputting a coding vector sequence corresponding to an image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 coding block in a residual manner; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector into a classifier of the visual transform model to obtain a classification result of the image to be classified. By the image classification method, the accuracy of image classification can be improved.

Description

Image classification method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an image classification method and device.
Background
With the development of artificial intelligence technology, a Vision Transformer (ViT) model is widely applied in the field of image processing. ViT models can achieve superior performance for many visual tasks, such as image classification using ViT models.
Since the ViT model is used for image classification, a large-scale data set is needed for training the ViT model. Therefore, people focus on designing more complex model architectures with more layers to improve the processing efficiency of ViT models on data.
However, as the ViT model layer number increases, the accuracy of image classification is greatly reduced. Therefore, how to improve the accuracy of image classification based on the ViT model is an important issue to be solved in the industry.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the invention provides an image classification method and device.
The invention provides an image classification method, which comprises the following steps:
acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence;
inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L;
determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.
Optionally, the coding block further includes a first normalization layer, a second normalization layer, and a feed-forward layer;
the inputting the embedded input vector sequence into the encoder and outputting the coding vector sequence corresponding to the image to be classified includes:
inputting the embedded input vector sequence into the first normalization layer for normalization processing to generate a processed embedded input vector sequence; inputting the processed embedded input vector sequence into the residual multi-head self-attention layer to generate a first vector sequence;
inputting the first vector sequence into the second normalization layer and the feedforward layer to generate a second vector sequence;
and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.
Optionally, the residual multi-head self-attention layer includes a first linear layer, a weighted residual scaling dot product attention layer, and a second linear layer, where the weighted residual scaling dot product attention layer includes H self-attention heads, a self-attention layer output graph of an mth weighted residual scaling dot product attention layer is connected with a self-attention graph of an M-1 weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;
the inputting the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate a first vector sequence, including:
inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified;
inputting the index vector sequence, the key vector sequence and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads;
and splicing self-attention results corresponding to the H attention heads, and inputting the splicing results into the second linear layer to generate the first vector sequence.
Optionally, the inputting the index vector sequence, the key vector sequence, and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention head, including:
for each self-attention head, generating an Mth-layer self-attention layer output map based on the index vector sequence and the key vector sequence;
generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram;
and generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.
Optionally, the preprocessing the image to be classified to obtain an embedded input vector sequence includes:
splitting the image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks;
and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate the embedding input vector sequence.
Optionally, the determining, based on the coding vector sequence, a feature vector corresponding to the image to be classified includes:
and determining the category embedding vector as a feature vector corresponding to the image to be classified.
Optionally, the classifier is obtained by training with a cross entropy loss function.
The present invention also provides an image classification apparatus, comprising:
the system comprises a preprocessing module, a classification module and a classification module, wherein the preprocessing module is used for acquiring an image to be classified and preprocessing the image to be classified to obtain an embedded input vector sequence;
the encoding module is used for inputting the embedded input vector sequence into an encoder of a visual transform model and outputting an encoding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform encoding blocks, each encoding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth encoding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th encoding block in a residual mode, L is a positive integer, and M is greater than or equal to 2 and less than or equal to L;
and the classification module is used for determining a feature vector corresponding to the image to be classified based on the coding vector sequence, inputting the feature vector to a classifier of a visual Transformer model, and obtaining a classification result of the image to be classified.
The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the image classification method as described in any of the above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image classification method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the image classification method as described in any one of the above.
The image classification method and the device provided by the invention input the embedded input vector sequence into an encoder of a visual transform model comprising a residual multi-head self-attention layer, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of an Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of an M-1 th coding block in a residual mode. Based on a self-attention diagram and a self-attention layer output diagram which are connected in a residual error mode, information exchange among coding blocks can be effectively enhanced, the network degradation problem of a visual transform model is relieved, an embedded input vector sequence is input into the visual transform model, a more comprehensive coding vector sequence corresponding to an image to be classified can be obtained, the image is classified based on the coding vector sequence, and the accuracy of image classification can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of an image classification method according to the present invention;
FIG. 2 is a schematic structural diagram of a visual transform model provided in the present invention;
FIG. 3 is a second flowchart of the image classification method according to the present invention;
FIG. 4 is a schematic structural diagram of an image classification apparatus provided in the present invention;
fig. 5 is a schematic physical structure diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the development of artificial intelligence technology, the ViT model is widely applied in the field of image processing. Unlike Convolutional Neural Networks (CNN), which rely on convolution to process local features, the ViT model uses a self-attention mechanism to establish the relationship between image block embedding Tokens, and this property of aggregating global information greatly increases the performance of the ViT model. The Transformer can achieve good performance on many visual tasks, including image classification, image enhancement, object detection, video processing, and the like.
However, while the ViT model is able to build global information, and is more flexible than the CNN model in learning image representations, it also results in the ViT model requiring larger-scale datasets for pre-training due to the larger capacity of the ViT model. Therefore, researchers have improved the data efficiency of the ViT model by designing more complex network architectures or training methods.
Notably, the self-attention (SA) mechanism of the ViT model is a key factor in the ability of the ViT model to aggregate global information. In the self-attention SA module of the ViT model, each Token is updated according to the self-attention map aggregating the features of all Tokens, and in this way, information can be sufficiently exchanged between Tokens, thereby providing a strong expression capability.
However, as the number of layers of the ViT model increases, the ViT model has a network degradation problem, and the accuracy of image classification based on the ViT model is greatly reduced.
Based on the above problems, in order to enhance information exchange between ViT model layers and alleviate the problem of network degradation, and further improve the accuracy of image classification based on ViT model, the invention provides an image classification method, which can effectively enhance information exchange between ViT model layers, alleviate the problem of network degradation of a visual Transformer model, and further effectively improve the accuracy of image classification.
The image classification method provided by the present invention is described in detail below with reference to fig. 1 to 3.
Referring to fig. 1, fig. 1 is a schematic flow chart of an image classification method provided by the present invention, and specifically includes steps 101 to 103.
Step 101, obtaining an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence.
Specifically, in this embodiment, the image to be classified refers to an image that needs to be classified, and the format of the image to be classified may be multiple, for example, a jpg format, a png format, a tif format, or a pdf format.
After the image to be classified is acquired, preprocessing is carried out on the image to be classified to obtain an embedded input vector sequence, wherein the embedded input vector sequence is a sequence input to a visual Transformer model.
Optionally, the image to be classified is preprocessed to obtain an embedded input vector sequence, which may be specifically implemented in the following manner:
splitting the image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks;
and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate the embedding input vector sequence.
Specifically, in this embodiment, after the image to be classified is acquired, the image to be classified needs to be split into a plurality of image blocks, and an image block embedded vector sequence is generated based on the image blocks.
In practice, for example, a dimension of x ∈ R is given H×W×C The image to be classified is firstly split into N P × P square image patches, and the patches are reshaped into a sequence, that is, each patch can be expressed as
Figure BDA0003696322160000071
Wherein H, W is the height and width of the image to be classified, respectively, i.e. (H, W) is the resolution of the image to be classified; (P, P) is the resolution of each patch; c is the number of channels of the image (e.g., C — 3); n is the number of latches, and can be represented by the following formula (1):
N=HW/P 2 (1)
after reshaping the N patches into a sequence, the corresponding sequence of the patches needs to be mapped using the linear projection layer
Figure BDA0003696322160000072
Dimension reduction is carried out, i.e. the dimension is P 2 C-mapping to D-dimension to generate image block embedded vector sequence x p E∈R N×D
After an image block embedded vector sequence is generated, a category embedded vector and a position embedded vector need to be added to the image block embedded vector sequence respectively, and then an embedded input vector sequence is generated, wherein the category embedded vector is used for classifying images to be classified, and the position embedded vector is used for representing spatial position information between image blocks.
For example, for a given length-N image block embedded vector sequence, a one-dimensional learnable embedded vector needs to be added at the head position of the image block embedded vector sequence
Figure BDA0003696322160000073
Further, as a class-embedded vector for classification, an image block embedded vector sequence to which a class-embedded vector is added is generated, and in this case, the image block embedded vector is R (N+1)×D Wherein, the total length of the image block embedding vector is N + 1; the learnable embedded vector is randomly initialized at training.
It should be noted that embedding this class into a vector
Figure BDA0003696322160000081
Output characteristics after input to visual Transformer model
Figure BDA0003696322160000082
For image representation, that is to say, will
Figure BDA0003696322160000083
A classifier (also called Classification Head) input to the visual Transformer model can classify images.
When the image block embedding vector sequence to which the category embedding vector is added is generated, a position embedding vector needs to be added to the image block embedding vector sequence, and an embedding input vector sequence is generated.
Specifically, a position embedding vector E of an image block may be added in an image block embedding vector sequence pos To preserve spatial position information between the input image blocks. Here, the image block embedding vector and the position embedding vector may be added element by using standard learnable 1-D position coding.
After a category embedding vector and a position embedding vector are respectively added to the image block embedding vector sequence, an embedding input vector sequence is further generated.
I.e. embedding a sequence of vectors from an image block
Figure BDA0003696322160000084
Class embedding vector x class Position-embedded vector E pos Constructing an embedded input vector sequence z 0 In particular, z 0 Can be expressed by the following way:
Figure BDA0003696322160000085
wherein the content of the first and second substances,
Figure BDA0003696322160000086
E pos ∈R (N+1)×D
in the above embodiment, a class embedding vector is added to the image block embedding vector sequence, so that the classification of the image to be classified can be realized; the position embedded vectors are added according to the image block embedded vector sequence, and the spatial position information among the image blocks can be represented, so that the accuracy of image classification is improved.
Step 102, inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-headed self-attention layer, a self-attention layer output diagram corresponding to the residual multi-headed self-attention layer of the mth coding block is connected with a self-attention diagram corresponding to the residual multi-headed self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, M is greater than or equal to 2, and is less than or equal to L.
Specifically, in this embodiment, after the image to be classified is preprocessed to obtain the embedded input vector sequence, the embedded input vector sequence needs to be input to an encoder of the visual Transformer model, so as to output the encoding vector sequence corresponding to the image to be classified.
It should be noted that the encoder of the visual Transformer model includes L transform coding blocks, each coding block includes a residual multi-headed self-attention layer, and a self-attention layer output graph corresponding to the residual multi-headed self-attention layer of the mth coding block is connected with a self-attention graph corresponding to the residual multi-headed self-attention layer of the M-1 coding block in a residual manner.
In practical application, residual connection is an effective strategy in a deep network, and the connection mode can enhance information exchange between model layers and reduce the problem of network degradation in the process of deep network learning.
Specifically, referring to fig. 2, fig. 2 is a schematic structural diagram of a visual transform model provided in the present invention.
In fig. 2, (a) shows a schematic structural diagram of a visual Transformer model, in (a), the visual Transformer model is divided into a Transformer encoder and a classifier, wherein the encoder includes L transform coding blocks, each coding block includes a residual multi-headed self-attention layer (MSA) and a feedforward neural network layer (i.e. feedforward layer FNN), the residual multi-headed self-attention layer and the feedforward neural network layer both have a normalization layer connected in a residual manner, it is noted that the FNN layer includes two fully-connected layers (FCs), a first FC is used for transforming a characteristic dimension, a second FC is used for restoring the characteristic dimension to a dimension before change, and a middle non-Linear activation function adopts a Gaussian Error Linear Unit activation function (GELU).
Optionally, in a possible implementation manner of the embodiment of the present invention, the encoder of the visual Transformer model further includes a first normalization layer, a second normalization layer, and a feedforward layer, and particularly, as shown in fig. 2 (a), the first normalization layer is connected to the residual multi-headed self-attention layer, and the second normalization layer is connected to the feedforward layer.
The inputting of the embedded input vector sequence into the encoder and the outputting of the encoded vector sequence corresponding to the image to be classified may be specifically implemented by:
inputting the embedded input vector sequence into the first normalization layer for normalization processing to generate a processed embedded input vector sequence; inputting the processed embedded input vector sequence into the residual multi-head self-attention layer to generate a first vector sequence;
inputting the first vector sequence into the second normalization layer and the feedforward layer to generate a second vector sequence;
and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.
Specifically, in this embodiment, as shown in (a) in fig. 2, after a category embedding vector and a position embedding vector are added to an image block embedding vector sequence respectively to generate an embedding input vector sequence, the embedding input vector sequence is input to a first normalization layer in a visual Transformer model encoding block to be normalized, so as to generate a processed embedding input vector sequence.
After the processed embedded input vector sequence is generated, the embedded input vector sequence is input to a residual multi-head self-attention layer to generate a first vector sequence; and finally, inputting the second vector sequence into the first normalization layer for L times of loop iteration to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.
That is, when an embedded input vector sequence is input into an encoder of a visual Transformer model for encoding, the embedded input vector sequence is input into a first normalization layer of a first encoding block of the encoder, and a processed embedded input vector sequence is generated; then inputting the processed embedded input vector sequence into a residual multi-head self-attention layer to generate a first vector sequence; then inputting the first vector sequence into a second normalization layer and a feedforward layer;
then, the output of the feedforward layer of the first coding block is used as the input of the first normalization layer of the second coding block, and the output of the feedforward layer of the second coding block is used as the input of the first normalization layer of the third coding block; and sequentially and iteratively executing N times until the output of the feedforward layer of the L-1 th coding block is used as the input of the first normalization layer of the L Transformer coding block, so that the coding vector sequence corresponding to the image to be classified output by the feedforward layer of the L Transformer coding block can be obtained.
In the above embodiment, the embedded input vector sequence is input into the coding block, and the coding block is subjected to loop iteration, so that information exchange between the coding blocks in the encoder can be effectively enhanced, the network degradation problem of the visual Transformer model is alleviated, and the embedded input vector sequence is input into the visual Transformer model, so that a more comprehensive coding vector sequence corresponding to the image to be classified can be obtained.
Optionally, in a possible implementation manner of the embodiment of the present invention, the residual multi-head self-attention layer includes a first linear layer, a weighted residual scaling dot product attention layer, and a second linear layer, where the weighted residual scaling dot product attention layer includes H self-attention heads, a self-attention layer output graph of an mth weighted residual scaling dot product attention layer is connected to an M-1 weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;
the inputting of the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate a first vector sequence may specifically be implemented in the following manner:
inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified;
inputting the index vector sequence, the key vector sequence and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads;
and splicing self-attention results corresponding to the H attention heads, and inputting the splicing results into the second linear layer to generate the first vector sequence.
Specifically, in this embodiment, the process of inputting the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate the first vector sequence may specifically refer to (b) in fig. 2:
when the processed embedded input vector sequence is input into a residual multi-header self-attention layer, the processed embedded input vector sequence is input into a first linear layer for linear transformation, and an index vector sequence (Query, Q), a Key vector sequence (Key, K) and a Value vector sequence (Value, V) corresponding to an image to be classified are generated, wherein weight coefficients corresponding to Q, K, V are respectively W i Q ,W i K ,W i V
Inputting Q, K, V into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads; the weighted residual error scaling dot product attention layer comprises H self-attention heads, a self-attention layer output graph of the Mth weighted residual error scaling dot product attention layer is connected with the M-1 th weighted residual error scaling dot product attention layer in a residual error mode, and H is a positive integer;
specifically, the output of the ith self-attention head can be expressed by the following formula (2):
head i =WRSA(QWi i Q ,KW i K ,VW i V ) (2)
wherein, i is 1,2,. and H; WRSA characterizes weighted residual self-attention outcomes (also known as weighted residual self-attention diagrams); the parameter W of each Q, K, V linear transformation is different.
After Q, K, V is input to the weighted residual scaling dot product attention layer to generate the self-attention results corresponding to the H self-attention heads, the self-attention results corresponding to the H self-attention heads need to be spliced, and the spliced result is input to the second linear layer to generate the first vector sequence.
Specifically, the stitching of the self-attention results corresponding to the H attention heads can be represented by the following formula (3):
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (3)
wherein w O Is a weight parameter for the second linear layer.
In the above embodiment, based on the self-attention diagram and the self-attention layer output diagram connected in a residual manner in the weighted residual scaling dot product attention layer, information exchange between coding blocks can be effectively enhanced, and the network degradation problem of the visual Transformer model can be alleviated; meanwhile, the weighted residual scaling dot product attention layer comprises H self-attention heads, and the visual transform model can learn related information of different spaces of the image to be classified by utilizing the H self-attention heads, so that the accuracy of image classification is improved.
Optionally, in a possible implementation manner of the embodiment of the present invention, the index vector sequence, the key vector sequence, and the value vector sequence after linear transformation are input into the weighted residual scaling dot product attention layer, so as to generate H self-attention results corresponding to the self-attention heads, which may specifically be implemented in the following manner:
for each self-attention head, generating an Mth-layer self-attention layer output map based on the index vector sequence and the key vector sequence;
generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram;
and generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.
Specifically, in this embodiment, the process of inputting Q, K, V into the weighted residual scaling dot product attention layer to generate the self-attention results corresponding to the H self-attention heads may specifically refer to (c) in fig. 2:
for each self-attention head, an mth layer self-attention layer output map is first generated based on Q and K.
Specifically, firstly, tensor multiplication is carried out on Q and K, and then scaling is carried out; normalizing the scaled Q, K by using a normalization index function to generate an M-th layer self-attention layer output map Att' (M)
After the mth layer self-attention layer output map is generated, a target self-attention map needs to be generated based on the M-1 th layer self-attention map and the mth layer self-attention layer output map.
Specifically, map Att 'is output at the self-attention layer where Mth layer is generated' (M) After that, the M-1 layer is subjected to self-attention map Att (M-1) And Att' (M) Connected in a residual manner, and is connected with Att' (M) Carrying out weighted summation to obtain the final target self-attention diagram Att of the Mth layer (M)
And finally, generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.
Specifically, the target self-attention map Att needs to be set (M) And carrying out tensor multiplication on the value vector sequence V, and further generating a self-attention result corresponding to each self-attention head. Wherein, the self-attention result corresponding to each self-attention head can be represented by the following formula (4):
WRSA(Q,K,V)=(Att (M)M Att (M-1) )V (4)
wherein the content of the first and second substances,
Figure BDA0003696322160000141
d k the dimension of K is used to play a role in regulation, so that the inner product of Q and K is not too large, and a more stable gradient can be achieved in the training process; alpha is alpha M Is a learnable constant, for alpha M At the time of initialization, α M =1。
In the above embodiment, the self-attention map Att connected in a residual manner in the attention layer is scaled based on the weighted residual error (M-1) And outputting map Att 'from attention layer' (M) The coding block can be effectively enhancedThe information exchange between the two modules relieves the network degradation problem of the visual Transformer model, and the embedded input vector sequence is input into the visual Transformer model, so that a more comprehensive coding vector sequence corresponding to the image to be classified can be obtained.
And 103, determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual Transformer model to obtain a classification result of the image to be classified.
Specifically, in this embodiment, after the embedded input vector sequence is input to the encoder of the visual Transformer model to obtain the coded vector sequence corresponding to the image to be classified, the feature vector corresponding to the image to be classified needs to be determined based on the coded vector sequence, where the feature vector is a vector used for classifying the image to be classified.
Optionally, in a possible implementation manner of the embodiment of the present invention, the determining, based on the coded vector sequence, the feature vector corresponding to the image to be classified may specifically be implemented in the following manner:
and determining the category embedding vector as a feature vector corresponding to the image to be classified.
Specifically, in the present embodiment, the class embedding vector is determined as the feature vector corresponding to the image to be classified.
It can be understood that the category embedding vector is determined as the feature vector corresponding to the image to be classified, because the category embedding vector in the coding vector sequence corresponding to the image to be classified is learnable, and the vector contains all feature information of the image to be classified.
Therefore, in practical application, the class embedding vector can be determined as the feature vector corresponding to the image to be classified.
After determining the feature vector corresponding to the image to be classified, inputting the feature vector into a classifier of the visual Transformer model to obtain a classification result of the image to be classified.
Specifically, the classifier of the transform model adds a normalization layer and two full-join layers. After the feature vector corresponding to the image to be classified is determined, the feature vector is input into a trained classifier, and then a classification result of the image to be classified is obtained.
Optionally, in a possible implementation manner of the embodiment of the present invention, the classifier is obtained by training using a cross entropy loss function.
Specifically, in this embodiment, the feature vector corresponding to the image to be classified is input to the classifier, and the classifier is trained by using the cross entropy loss function until the loss value reaches the preset threshold value, which indicates that the training of the classifier is completed.
In practical applications, the cross entropy loss function can be expressed by the following equation (5):
Figure BDA0003696322160000151
wherein L represents a cross entropy loss value, y i A label representing a sample i, the positive class being 1 and the negative class being 0; p is a radical of i Indicating the probability that sample i is predicted as a positive class.
In the above embodiment, the feature vector corresponding to the image to be classified is determined based on the coding vector sequence, and the feature vector is input to the classifier of the visual Transformer model, so that the accuracy of image classification can be effectively improved.
The image classification method provided by the invention inputs an embedded input vector sequence into an encoder of a visual transform model comprising a residual multi-head self-attention layer, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of an Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of an M-1 th coding block in a residual mode. Based on a self-attention diagram and a self-attention layer output diagram which are connected in a residual error mode, information exchange among coding blocks can be effectively enhanced, the network degradation problem of a visual transform model is relieved, an embedded input vector sequence is input into the visual transform model, a more comprehensive coding vector sequence corresponding to an image to be classified can be obtained, the image is classified based on the coding vector sequence, and the accuracy of image classification can be effectively improved.
Referring to fig. 3, fig. 3 is a second schematic flow chart of the image classification method provided by the present invention, which specifically includes steps 301 to 310:
step 301, obtaining an image to be classified.
Step 302, splitting an image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks; and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate an embedding input vector sequence.
And step 303, inputting the embedded input vector sequence into a first normalization layer in a visual Transformer model coding block for normalization processing, and generating a processed embedded input vector sequence.
In the residual multi-head self-attention layer, step 304-308 is performed:
and step 304, inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified.
And 305, generating an M-th layer self-attention layer output graph for each self-attention head in the weighted residual scaling dot product attention layer based on the index vector sequence and the key vector sequence, wherein M is greater than or equal to 2 and less than or equal to L.
And step 306, generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram.
And 307, generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.
And 308, splicing self-attention results corresponding to the H attention heads, inputting the spliced results into a second linear layer, and generating a first vector sequence.
Step 309, inputting the first vector sequence into a second normalization layer and a feedforward layer to generate a second vector sequence; and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.
Step 310, determining the category embedded vector as a feature vector corresponding to the image to be classified; and inputting the characteristic vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.
The following describes the image classification apparatus provided by the present invention, and the image classification apparatus described below and the image classification method described above may be referred to in correspondence with each other.
Fig. 4 is a schematic structural diagram of an image classification apparatus 400 provided in the present invention.
The pre-processing module 401 is configured to obtain an image to be classified, and pre-process the image to be classified to obtain an embedded input vector sequence;
an encoding module 402, configured to input the embedded input vector sequence to an encoder of a visual transform model, and output an encoded vector sequence corresponding to the image to be classified, where the encoder includes L transform coding blocks, each coding block includes a residual multi-headed self-attention layer, a self-attention layer output graph corresponding to the residual multi-headed self-attention layer of an mth coding block is connected to a self-attention graph corresponding to the residual multi-headed self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, and M is greater than or equal to 2 and less than or equal to L;
a classifying module 403, configured to determine a feature vector corresponding to the image to be classified based on the coding vector sequence, and input the feature vector to a classifier of a visual Transformer model to obtain a classification result of the image to be classified.
The image classification device provided by the invention inputs an embedded input vector sequence into an encoder of a visual Transformer model comprising a residual multi-head self-attention layer, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, and a self-attention layer output graph corresponding to the residual multi-head self-attention layer of an Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of an M-1 th coding block in a residual mode. Based on a self-attention diagram and a self-attention layer output diagram which are connected in a residual error mode, information exchange among coding blocks can be effectively enhanced, the problem of network degradation of a visual Transformer model is relieved, an embedded input vector sequence is input into the visual Transformer model, a more comprehensive coding vector sequence corresponding to an image to be classified can be obtained, the image is classified based on the coding vector sequence, and the accuracy of image classification can be effectively improved.
Optionally, the coding block further includes a first normalization layer, a second normalization layer, and a feed-forward layer;
an encoding module 402, further configured to:
inputting the embedded input vector sequence into the first normalization layer for normalization processing to generate a processed embedded input vector sequence; inputting the processed embedded input vector sequence into the residual multi-head self-attention layer to generate a first vector sequence;
inputting the first vector sequence into the second normalization layer and the feedforward layer to generate a second vector sequence;
and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.
Optionally, the residual multi-head self-attention layer includes a first linear layer, a weighted residual scaling dot product attention layer, and a second linear layer, where the weighted residual scaling dot product attention layer includes H self-attention heads, a self-attention layer output diagram of an mth weighted residual scaling dot product attention layer is connected with a self-attention diagram of an M-1 weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;
an encoding module 402, further configured to:
inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified;
inputting the index vector sequence, the key vector sequence and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads;
and splicing self-attention results corresponding to the H attention heads, and inputting the splicing results into the second linear layer to generate the first vector sequence.
Optionally, the encoding module 402 is further configured to:
for each self-attention head, generating an M-th layer self-attention layer output graph based on the index vector sequence and the key vector sequence;
generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram;
and generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.
Optionally, the preprocessing module 401 is further configured to:
splitting the image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks;
and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate the embedding input vector sequence.
Optionally, the classification module 403 is further configured to:
and determining the category embedding vector as a feature vector corresponding to the image to be classified.
Optionally, the classifier is obtained by training with a cross entropy loss function.
Fig. 5 is a schematic physical structure diagram of an electronic device 500 provided in the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform an image classification method comprising: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coded vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the image classification method provided by the above methods, the method comprising: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an image classification method provided by the above methods, the method comprising: acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence; inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coding vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L; determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (11)

1. An image classification method, comprising:
acquiring an image to be classified, and preprocessing the image to be classified to obtain an embedded input vector sequence;
inputting the embedded input vector sequence into an encoder of a visual transform model, and outputting a coded vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform coding blocks, each coding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth coding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 coding block in a residual manner, L is a positive integer, M is more than or equal to 2, and is less than or equal to L;
determining a feature vector corresponding to the image to be classified based on the coding vector sequence, and inputting the feature vector to a classifier of a visual transform model to obtain a classification result of the image to be classified.
2. The image classification method according to claim 1, wherein the coding block further includes a first normalization layer, a second normalization layer, and a feed-forward layer;
the inputting the embedded input vector sequence into the encoder and outputting the coding vector sequence corresponding to the image to be classified includes:
inputting the embedded input vector sequence into the first normalization layer for normalization processing to generate a processed embedded input vector sequence; inputting the processed embedded input vector sequence into the residual multi-head self-attention layer to generate a first vector sequence;
inputting the first vector sequence into the second normalization layer and the feedforward layer to generate a second vector sequence;
and inputting the second vector sequence into the first normalization layer to perform loop iteration for L times to obtain a coding vector sequence corresponding to the image to be classified output by the feedforward layer.
3. The image classification method according to claim 2, wherein the residual multi-head self-attention layer comprises a first linear layer, a weighted residual scaling dot product attention layer and a second linear layer, wherein the weighted residual scaling dot product attention layer comprises H self-attention heads, a self-attention layer output graph of an Mth weighted residual scaling dot product attention layer is connected with a self-attention graph of an M-1 th weighted residual scaling dot product attention layer in a residual manner, and H is a positive integer;
the inputting the processed embedded input vector sequence into the residual multi-headed self-attention layer to generate a first vector sequence, including:
inputting the processed embedded input vector sequence into the first linear layer to generate an index vector sequence, a key vector sequence and a value vector sequence corresponding to the image to be classified;
inputting the index vector sequence, the key vector sequence and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads;
and splicing self-attention results corresponding to the H attention heads, and inputting the splicing results into the second linear layer to generate the first vector sequence.
4. The image classification method according to claim 3, wherein the step of inputting the index vector sequence, the key vector sequence, and the value vector sequence into the weighted residual scaling dot product attention layer to generate H self-attention results corresponding to the self-attention heads comprises:
for each self-attention head, generating an Mth-layer self-attention layer output map based on the index vector sequence and the key vector sequence;
generating a target self-attention diagram based on the M-1 layer self-attention diagram and the M layer self-attention layer output diagram;
and generating a self-attention result corresponding to each self-attention head based on the target self-attention diagram and the value vector sequence.
5. The image classification method according to claim 1, wherein the preprocessing the image to be classified to obtain an embedded input vector sequence comprises:
splitting the image to be classified into a plurality of image blocks, and generating an image block embedded vector sequence based on the image blocks;
and adding a category embedding vector and a position embedding vector for the image block embedding vector sequence respectively to generate the embedding input vector sequence.
6. The image classification method according to claim 5, wherein the determining the feature vector corresponding to the image to be classified based on the coding vector sequence comprises:
and determining the category embedding vector as a feature vector corresponding to the image to be classified.
7. The image classification method according to claim 1, characterized in that the classifier is trained using a cross entropy loss function.
8. An image classification apparatus, comprising:
the image classification device comprises a preprocessing module, a classification module and a classification module, wherein the preprocessing module is used for acquiring an image to be classified and preprocessing the image to be classified to obtain an embedded input vector sequence;
the encoding module is used for inputting the embedded input vector sequence into an encoder of a visual transform model and outputting an encoded vector sequence corresponding to the image to be classified, wherein the encoder comprises L transform encoding blocks, each encoding block comprises a residual multi-head self-attention layer, a self-attention layer output graph corresponding to the residual multi-head self-attention layer of the Mth encoding block is connected with a self-attention graph corresponding to the residual multi-head self-attention layer of the M-1 th encoding block in a residual mode, L is a positive integer, M is more than or equal to 2 and is less than or equal to L;
and the classification module is used for determining a feature vector corresponding to the image to be classified based on the coding vector sequence, inputting the feature vector to a classifier of a visual Transformer model, and obtaining a classification result of the image to be classified.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image classification method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the image classification method according to any one of claims 1 to 7.
11. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the image classification method according to any one of claims 1 to 7.
CN202210681224.8A 2022-06-15 2022-06-15 Image classification method and device Pending CN115131607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210681224.8A CN115131607A (en) 2022-06-15 2022-06-15 Image classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210681224.8A CN115131607A (en) 2022-06-15 2022-06-15 Image classification method and device

Publications (1)

Publication Number Publication Date
CN115131607A true CN115131607A (en) 2022-09-30

Family

ID=83377601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210681224.8A Pending CN115131607A (en) 2022-06-15 2022-06-15 Image classification method and device

Country Status (1)

Country Link
CN (1) CN115131607A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645566A (en) * 2023-07-21 2023-08-25 中国科学院自动化研究所 Classification method based on full-addition pulse type transducer
CN117036832A (en) * 2023-10-09 2023-11-10 之江实验室 Image classification method, device and medium based on random multi-scale blocking

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051503A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Semantic representation model-based text classification method and apparatus, and computer device
CN113469283A (en) * 2021-07-23 2021-10-01 山东力聚机器人科技股份有限公司 Image classification method, and training method and device of image classification model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051503A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Semantic representation model-based text classification method and apparatus, and computer device
CN113469283A (en) * 2021-07-23 2021-10-01 山东力聚机器人科技股份有限公司 Image classification method, and training method and device of image classification model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645566A (en) * 2023-07-21 2023-08-25 中国科学院自动化研究所 Classification method based on full-addition pulse type transducer
CN116645566B (en) * 2023-07-21 2023-10-31 中国科学院自动化研究所 Classification method based on full-addition pulse type transducer
CN117036832A (en) * 2023-10-09 2023-11-10 之江实验室 Image classification method, device and medium based on random multi-scale blocking
CN117036832B (en) * 2023-10-09 2024-01-05 之江实验室 Image classification method, device and medium based on random multi-scale blocking

Similar Documents

Publication Publication Date Title
Parmar et al. Image transformer
US11488009B2 (en) Deep learning-based splice site classification
Dinh et al. Density estimation using real nvp
CN111079532B (en) Video content description method based on text self-encoder
WO2022017245A1 (en) Text recognition network, neural network training method, and related device
US10579923B2 (en) Learning of classification model
CN115131607A (en) Image classification method and device
CN112529150A (en) Model structure, model training method, image enhancement method and device
CN112801280B (en) One-dimensional convolution position coding method of visual depth self-adaptive neural network
CN110728297B (en) Low-cost antagonistic network attack sample generation method based on GAN
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
Flenner et al. A deep non-negative matrix factorization neural network
Gallant et al. Positional binding with distributed representations
CN114418030A (en) Image classification method, and training method and device of image classification model
CN114782291B (en) Training method and device of image generator, electronic equipment and readable storage medium
CN114973222A (en) Scene text recognition method based on explicit supervision mechanism
CN114565789B (en) Text detection method, system, device and medium based on set prediction
Li et al. Image operation chain detection with machine translation framework
CN114581918A (en) Text recognition model training method and device
WO2023068953A1 (en) Attention-based method for deep point cloud compression
CN113869005A (en) Pre-training model method and system based on sentence similarity
CN116168394A (en) Image text recognition method and device
CN115862015A (en) Training method and device of character recognition system, and character recognition method and device
CN115471576A (en) Point cloud lossless compression method and device based on deep learning
CN115273110A (en) Text recognition model deployment method, device, equipment and storage medium based on TensorRT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination