CN117994607A

CN117994607A - Training method of vision self-attention model

Info

Publication number: CN117994607A
Application number: CN202410130601.8A
Authority: CN
Inventors: 刘欣刚; 宫昊宇; 吴少智; 苏涵; 冯承霖; 张立澄; 彭伟航
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-05-07

Abstract

The invention discloses a training method of a visual self-attention model, which is characterized in that a channel attention mechanism is added in a backbone network of the visual self-attention model; training a visual self-attention model, comprising: acquiring a plurality of training samples; inputting an input sample in the current training sample into a visual self-attention model, and determining a plurality of feature vectors; splicing the classifying head vectors and the learnable position vectors for a plurality of feature vectors to obtain a feature vector to be input; inputting the feature vector to be input into a visual encoder to obtain feature information to be fused; processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result; and repeatedly executing the number of times of determining the prediction result based on the current training sample to reach a preset number of times threshold, taking the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold, improving the learning ability of the visual self-attention model to the training sample, and inhibiting the overfitting phenomenon of the model.

Description

Training method of vision self-attention model

Technical Field

The invention relates to the technical field of computer vision, in particular to a training method of a vision self-attention model.

Background

With the application of self-attention models in the field of computer vision, visual self-attention models demonstrate that a purely attention network can achieve better results on large data sets than a convolutional neural network.

However, in the existing visual self-attention model, only spatial attention is used, and no data interaction is performed between different channels of the same feature, which causes the problem that the visual self-attention model focuses more on "where" an object in an input image is rather than "what. In addition, since the visual self-attention model Vision Transformer has strong contact context capability, the visual self-attention model can achieve a good effect when trained on the training set of the small data set, but the effect on the verification set and the test set is often far away from the training set, and a serious overfitting phenomenon occurs. Based on the method, the invention provides a technical scheme of a visual self-attention model training method added with a channel attention mechanism.

Disclosure of Invention

The invention provides a training method of a visual self-attention model, which improves the learning ability of the visual self-attention model to training samples and can inhibit the overfitting phenomenon of the visual self-attention model.

According to an aspect of the present invention, there is provided a training method of a visual self-attention model, the method comprising:

adding a channel attention mechanism in a backbone network of the visual self-attention model;

training a visual self-attention model, comprising:

obtaining a plurality of training samples, wherein the training samples comprise input samples and actual classification results corresponding to the input samples;

for each training sample, inputting an input sample in the current training sample into a visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample;

Splicing the classifying head vectors and the leachable position vectors corresponding to the characteristic vectors for the characteristic vectors to obtain the characteristic vectors to be input;

inputting the feature vector to be input into a visual encoder for information interaction processing to obtain feature information to be fused;

processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result;

and repeatedly executing the times of determining the prediction result based on the current training sample to reach a preset times threshold, and taking the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold.

According to another aspect of the present invention, there is provided a training apparatus for a visual self-attention model, the apparatus comprising:

The training sample acquisition module is used for acquiring a plurality of training samples, wherein the training samples comprise input samples and actual classification results corresponding to the input samples;

The feature vector determining module is used for inputting an input sample in the current training sample into the visual self-attention model for each training sample, and determining a plurality of feature vectors corresponding to the input sample; wherein a channel attention mechanism is added in the backbone network of the visual self-attention model.

The feature vector splicing module is used for splicing the classifying head vectors and the leachable position vectors corresponding to the feature vectors for the feature vectors to obtain the feature vectors to be input;

the information interaction processing module is used for inputting the feature vector to be input into the visual encoder for information interaction processing to obtain feature information to be fused;

The prediction result acquisition module is used for processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result;

And the model iterative training module is used for repeatedly executing and determining the number of times of the prediction result to reach a preset number of times threshold based on the current training sample, and taking the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold.

According to another aspect of the present invention, there is provided an electronic device including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training the visual self-attention model of any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a training method for implementing a visual self-attention model of any of the embodiments of the present invention.

According to the technical scheme, a channel attention mechanism is added in a backbone network of a visual self-attention model; and training a visual self-attention model comprising: obtaining a plurality of training samples, wherein the training samples comprise input samples and actual classification results corresponding to the input samples; for each training sample, inputting an input sample in the current training sample into a visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample; splicing the classifying head vectors and the leachable position vectors corresponding to the characteristic vectors for the characteristic vectors to obtain the characteristic vectors to be input; inputting the feature vector to be input into a visual encoder for information interaction processing to obtain feature information to be fused; processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result; and repeatedly executing the number of times of determining the prediction result based on the current training sample to reach a preset number of times threshold, taking the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold, increasing the information interaction between channels, improving the learning ability of the visual self-attention model to the training sample, and thus inhibiting the overfitting phenomenon of the visual self-attention model.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a training method for a visual self-attention model provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a method for training a visual self-attention model provided by an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a training device for a visual self-attention model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device implementing a training method of a visual self-attention model according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a training method of a visual self-attention model according to an embodiment of the present invention, where the embodiment may be adapted to add a channel attention mechanism in a backbone network of the visual self-attention model, so as to enhance information interaction between channels and inhibit a situation of model overfitting phenomenon, and the method may be performed by a training device of the visual self-attention model, where the training device of the visual self-attention model may be implemented in a hardware and/or software form, and the training device of the visual self-attention model may be configured in an electronic device such as a mobile phone, a computer, or a server. As shown in fig. 1, the method includes:

s110, acquiring a plurality of training samples, wherein the training samples comprise input samples and actual classification results corresponding to the input samples.

In the embodiment of the invention, the training samples comprise input samples and actual classification results corresponding to the input samples. The input samples in the training samples are sample images to be classified. The actual classification result is a theoretical classification result corresponding to the sample image to be classified. The input sample may be an image captured by an imaging device, or an image stored in advance in a storage space, or an image obtained from a public data set, which is not limited in this embodiment. Wherein the input sample may be a sample image determined according to a specific task. For example, if the task is to determine whether the current scene has a risk, the input samples may include a series of sample images taken when there is a risk and a series of sample images taken when there is no risk, respectively; if the task is to determine the pet category, the input sample may be a plurality of sample images including different pets, and the actual classification result may correspond to the classification category of the sample image including the pets.

Specifically, before training the visual self-attention model, a plurality of training samples need to be acquired to train the visual self-attention model based on the training samples. In order to improve the accuracy of the visual self-attention model, training samples can be acquired as much and as much as possible, and a plurality of sample images to be classified can be acquired.

S120, for each training sample, inputting an input sample in the current training sample into the visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample.

In embodiments of the present invention, the visual self-attention model may be used to process a variety of visual tasks, for example, the visual self-attention model may process visual tasks such as image classification, object detection, semantic segmentation, point cloud processing, motion recognition, and self-supervised learning. Alternatively, the visual self-attention model may be Vision Transformer, swin transducer, DETR, etc. self-attention model. A feature vector may be a representation of a feature of an input sample in a particular feature space that contains various attributes or features of the input sample. For example, the feature vector is a high-dimensional vector, and then each dimension may correspond to a feature, which may be a color, texture, shape, etc. of the sample image.

Specifically, after the training sample is obtained, the input sample in the training sample may be input into the visual self-attention model, where when the input sample is input into the visual self-attention model, the input sample may be first converted into a tensor format for subsequent training of the visual self-attention model. After the input samples after the conversion format are input to the visual self-attention model, the input samples can be processed by using non-overlapping convolution check of the visual self-attention model to obtain a plurality of feature vectors corresponding to the input samples.

Wherein a channel attention mechanism is added in the backbone network of the visual self-attention model.

In embodiments of the present invention, the backbone network generally refers to the main part of the visual self-attention model, which is responsible for extracting the features of the input data. Channel attention mechanisms are attention mechanisms that focus on the dependencies between different channels.

Specifically, a channel attention mechanism is added in the backbone network of the visual self-attention model to strengthen the information interaction between different channels. In the channel attention mechanism, each channel generally corresponds to a particular feature of the input image, such as color, texture, etc. By introducing a channel attention mechanism, the visual self-attention model can better understand the interdependencies between these features, thereby extracting the features more efficiently.

Optionally, inputting an input sample in the current training sample into the visual self-attention model, determining a plurality of feature vectors corresponding to the input sample, including: performing convolution operation and flattening on the input sample based on at least one convolution check in the visual self-attention model to obtain a plurality of feature vectors corresponding to the input sample; wherein the plurality of feature vectors correspond to different regions of the input sample.

In embodiments of the present invention, one or more convolution kernels are typically used in convolving the input samples. The convolution kernel can be regarded as a small matrix that performs a point-to-point product operation on a local area of the input sample, and then adds up the results to effectively extract the local features of the input sample. Flattening may be understood as the conversion of multidimensional data into one-dimensional data. After the convolution operation, a multi-dimensional output is typically obtained, which contains a signature for each channel. The flattening operation may convert these feature maps into one-dimensional vectors for subsequent processing or concatenation. Correspondingly, the converted one-dimensional vector is the feature vector.

Specifically, after the input sample is input to the visual self-attention model, the input sample can be converted into a tensor format, then the input sample is subjected to convolution operation by utilizing at least one non-overlapping convolution check to obtain a feature map after convolution, and further, the feature map of each channel is subjected to flattening operation and is converted into a one-dimensional vector, namely a feature vector. Wherein, since at least one convolution kernel is non-overlapping, each convolution operation is performed over a window of a fixed size of the input samples, the window being the same size as the convolution kernel, and the plurality of feature vectors obtained correspond to different regions of the input samples.

For example, if the current input sample is a 240×240 pixel sample image, the channel length of the sample image is 3, i.e., each pixel in the sample image is composed of three channels. The sample image is input into Vision Transformer model of added channel attention mechanism, and the sample image is convolved and flattened by non-overlapping convolution check, so that a series of image blocks (token) can be obtained. If the pixels of the image block are set to 16×16, 225 token blocks, i.e., 225 feature vectors, can be obtained.

S130, splicing the classifying head vectors for the plurality of feature vectors and the leachable position vectors corresponding to the plurality of feature vectors to obtain the feature vectors to be input.

In the embodiment of the invention, the classification head vector may be a part for classification, typically a full connection layer, for mapping the spliced feature vector into a predefined class. Wherein the weights and bias of the classification head vectors are parameters that can be learned. The learnable position vectors may be used to characterize the position of each feature vector in the stitched feature vectors. Correspondingly, after a classification head vector and a leachable position vector corresponding to the plurality of feature vectors are spliced by the plurality of feature vectors, the spliced vector is the feature vector to be input.

Specifically, a classification head vector is randomly spliced for a plurality of feature vectors, and meanwhile, a corresponding leachable position vector is added for each feature vector, so that the position of the feature vector is represented based on the leachable position vector, and the feature vector to be input is obtained.

Optionally, splicing the classification head vector for the plurality of feature vectors and the learnable position vector corresponding to the plurality of feature vectors to obtain the feature vector to be input, including: randomly splicing the classifying head vectors at the first positions of the plurality of feature vectors, and determining a learnable position vector according to the display areas of the plurality of feature vectors in the input samples to obtain the feature vector to be input.

In the embodiment of the invention, since the input sample corresponds to a plurality of feature vectors, each feature vector corresponds to a different region of the input sample, and each region is a display region of the input sample.

Specifically, before the classification head vectors are spliced, the classification head vectors may be subjected to a randomization operation, for example, random initialization or random scrambling of the element sequence. And then splicing the randomized classified head vectors to the first positions of the plurality of feature vectors. It is understood that the classification head vector is the first part of the feature vector to be input, and is the part of the feature vector to be input for classification. In addition, since the feature vectors are determined based on the image information contained in different areas of the input samples, the display area of the input sample corresponding to each feature vector can be determined, and the corresponding learnable position vector can be determined to identify different display areas of the input sample. Based on this, a plurality of feature vectors after the classification head vector and the leachable position vector are spliced are recorded as feature vectors to be input.

And S140, inputting the feature vector to be input into a visual encoder for information interaction processing to obtain feature information to be fused.

In the embodiment of the invention, a visual encoder, namely a transducer encoder, can be used for carrying out information interaction processing on the feature vector to be input. The visual encoder can comprise a spatial self-attention layer and a channel self-attention layer, and the spatial self-attention layer and the channel self-attention layer are used for realizing information interaction of spatial attention and information interaction between different channels. Correspondingly, the feature information to be fused can be feature information corresponding to the feature vector obtained after the feature vector to be input is subjected to information interaction processing.

Specifically, after the feature vector to be input is obtained, the feature vector to be input may be input to the visual encoder to perform the attention operation. The feature vector to be input can be input into a spatial self-attention layer of the visual encoder to perform self-attention operation, so that the feature vector to be input establishes spatial information interaction, and then the vector output by the spatial self-attention layer can be converted from a spatial dimension into a channel dimension by utilizing the characteristic of matrix transposition so as to meet the input requirement of the channel self-attention layer. And inputting the transposed vector to a channel self-attention layer of the visual encoder, and performing information interaction between different channels on the transposed vector to obtain feature information corresponding to the vector output by the channel self-attention layer, namely the feature information to be fused.

Optionally, inputting the feature vector to be input to the visual encoder for information interaction processing to obtain feature information to be fused, including: performing attention processing on the feature vector to be input based on a spatial self-attention layer in the visual encoder to obtain a spatially interacted feature vector to be processed; and processing the feature vector to be processed on the channel based on the channel self-attention layer in the visual encoder to obtain the feature information to be fused.

In embodiments of the present invention, a spatial self-attention layer is typically used to focus on the relationship of feature vectors to be input in the spatial dimension. The channel self-attention layer focuses on the relation of the feature vector to be input in the channel dimension.

Specifically, the feature vectors to be input are input into the spatial self-attention layer of the visual encoder to perform self-attention operation, namely the similarity between each feature vector in the feature vectors to be input can be calculated, and the information interaction of the spatial attention layer is realized between different feature vectors by weighting each feature vector. Correspondingly, the obtained vector is the feature vector to be processed. And then, the feature vector to be processed is transposed by utilizing the transposition characteristic of the matrix, and the transposed feature vector to be processed is input into a channel self-attention layer in the visual encoder so as to realize information interaction of the feature vector to be processed among different channels, and after the visual encoder processes, the feature information to be fused corresponding to the output feature vector can be obtained.

S150, processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result.

In an embodiment of the present invention, a multi-layer perceptron (MLP) is a multi-layer neural network that is divided into at least three layers: an input layer, a hidden layer, and an output layer. The input layer is responsible for receiving externally input data, namely receiving feature information to be fused, the hidden layer converts the feature information to be fused into meaningful feature representation through a series of complex calculations, and the output layer generates a prediction result according to the features. The prediction result may be understood as a prediction classification result corresponding to the input sample.

Specifically, the obtained feature information to be fused is input into a multi-layer perceptron, and the feature information to be fused is integrated through the multi-layer perceptron to obtain an integrated feature vector. And extracting a classification head vector from the integrated feature vectors to process the classification head vector based on the corresponding perceptron, thereby obtaining an image classification result corresponding to the input sample, namely a prediction result.

And S160, repeatedly executing the determination of the predicted result based on the current training sample to reach a preset frequency threshold, and taking the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold.

In the embodiment of the present invention, the preset number of times threshold may be a standard value of the number of times that the current training sample repeatedly performs determining the prediction result, which is preset according to the size of the visual self-attention model. For example, the preset number of times threshold may be 12 times. The accuracy may be the ratio of the correct outcome to the total predicted outcome in the visual self-attention model predicted outcome. The preset accuracy threshold may be an ideal value of accuracy of the visual self-attention model defined according to actual requirements.

In general, when the visual self-attention model is trained, each model parameter in the model can be corrected based on the accuracy corresponding to the prediction result of the visual self-attention model, that is, the loss value of the visual self-attention model can be corrected, so that the trained visual self-attention model can be obtained. The loss value is a difference value between the actual classification result and the prediction result.

When the model parameters in the visual self-attention model are corrected by using the loss values, the loss function can be converged to serve as a training target, for example, whether the training error is smaller than a preset error, or whether the error change tends to be stable, or whether the current iteration times are equal to preset times, or whether the current accuracy reaches a preset accuracy threshold. If the detection reaches the convergence condition, for example, the training error of the loss function is smaller than the preset error, or the error change trend tends to be stable, the visual self-attention model training is completed, and at the moment, the iterative training can be stopped. If the current condition of convergence is not detected, other training samples can be further obtained to train the vision self-attention model until the training error of the loss function is within a preset range. When the training error of the loss function reaches convergence, the visual self-attention model is trained. At this time, when the training sample is obtained by the visual self-attention model after training for training, the corresponding accuracy can reach the preset accuracy threshold.

Specifically, the current training sample is repeatedly input into the visual self-attention model for repeated execution until the number of times of the predicted result reaches a preset number of times threshold, and at this time, the current training sample is completely trained, then the next training sample can be used as the current training sample and is input into the visual self-attention model for training until the accuracy of the visual self-attention model reaches a preset accuracy threshold. Accordingly, the model parameters of the visual self-attention model can be corrected based on the accuracy rate to obtain a trained visual self-attention model.

Example two

Fig. 2 is a flowchart of a training method of a visual self-attention model according to an embodiment of the present invention, which is a preferred embodiment of the above-mentioned embodiments. The specific implementation manner can be seen in the technical scheme of the embodiment. Wherein, the technical terms identical to or corresponding to the above embodiments are not repeated herein. As shown in fig. 2, the method includes:

S210, acquiring a plurality of training samples, wherein the training samples comprise input samples and actual classification results corresponding to the input samples.

S220, for each training sample, inputting an input sample in the current training sample into the visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample.

S230, splicing the classifying head vectors for the plurality of feature vectors and the leachable position vectors corresponding to the plurality of feature vectors to obtain the feature vectors to be input.

S240, inputting the feature vector to be input into a visual encoder for information interaction processing to obtain feature information to be fused.

S250, processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result.

S260, according to a preset execution times threshold, the current training sample is input into the visual self-attention model again as the training sample, and a prediction result corresponding to each execution is obtained.

In the embodiment of the present invention, the preset execution time threshold may be understood as a standard value of the repeated execution times of the current training sample. Since the training sample is executed once, a preset result is obtained correspondingly once. The execution times threshold is consistent with the preset times threshold.

Specifically, according to a preset execution times threshold, judging whether the execution times of the current training sample reach the execution times threshold, if not, inputting the current training sample again as the training sample into the visual self-attention model for repeated execution until the execution times of the current training sample reach the execution times threshold. At this time, the completion of the execution of the current training sample is described. Accordingly, each execution of the current training sample correspondingly obtains a corresponding prediction result. I.e. the number of predictors obtained corresponds to the number of executions.

S270, after the training samples of the current batch participate in the training of the visual self-attention model, determining the accuracy of the visual self-attention model after the training samples of the current batch are trained based on the prediction result of each training and the actual classification result in the training samples.

Specifically, after training of each training sample is completed, comparing an actual classification result corresponding to the input sample during each training with a prediction result obtained by the training, determining whether the prediction result of the training is correct, and determining the ratio of the correct number of the prediction results to the number of all the prediction results in the training samples of the current batch, namely the accuracy rate, after all the training samples of the current batch participate in the visual self-attention model training. Therefore, whether the visual self-attention model is trained is judged according to the accuracy rate.

Optionally, the number of training samples of the current batch is less than a preset number threshold.

In the embodiment of the invention, as the visual self-attention model is easy to be fitted on the small data set, the standard value of the training sample number can be preset as the preset number threshold value so as to ensure that the data set corresponding to the training sample of the current batch is the small data set. That is, the number of training samples for the current batch is less than the preset number threshold. For example, the preset number threshold may be 1000.

And S280, when the accuracy is smaller than a preset accuracy threshold, acquiring a next batch of training samples, repeatedly executing the step of training the vision self-attention model, and determining the accuracy corresponding to the next batch of training samples until the accuracy reaches the preset accuracy threshold.

Specifically, when the accuracy corresponding to the training samples of the current batch is smaller than the preset accuracy threshold, it is indicated that the visual self-attention model is not trained, at this time, the training samples of the next batch may be obtained, and the steps of training the visual self-attention model are repeatedly performed, that is, S220 to S280. And comparing the predicted result of each training sample of the next batch with the actual classification result, and determining the accuracy corresponding to the training samples of the next batch. And if the accuracy of the training samples of the next batch reaches a preset accuracy threshold, finishing the training of the vision self-attention model. Correspondingly, if the accuracy of the training samples of the next batch does not reach the preset accuracy threshold, the training samples can be continuously obtained to train the vision self-attention model until the accuracy reaches the preset accuracy threshold. At this point, the visual self-attention model training is completed.

After the visual self-attention model is trained, the trained visual self-attention model can be used for processing the data to be classified.

Optionally, the method further comprises: acquiring data to be classified; and inputting the data to be classified into the trained visual self-attention model to obtain a target classification result of the data to be classified.

In an embodiment of the present invention, the data to be classified may be a sample image to be classified. The data to be classified may be an image captured by an image capturing device, an image stored in advance in a storage space, or an image obtained from a public data set, which is not limited in this embodiment. Accordingly, the target classification result may be a model output result corresponding to the input data to be classified.

Specifically, after obtaining the trained visual self-attention model, the data to be classified may be obtained to verify the trained visual self-attention model based on the data to be classified. And inputting the acquired data to be classified into a trained visual self-attention model to obtain a target classification result corresponding to the data to be classified.

For example, if the input samples in the training samples are a plurality of sample images including different pets, the actual classification result may correspond to the pet category of the sample image including the pet. After the step of training the visual self-attention model, a trained visual self-attention model is obtained. Then, if the obtained data to be classified is a sample image containing pets, inputting the sample image into a trained visual self-attention model, and obtaining the pet category corresponding to the sample image.

Optionally, the input sample in the training sample is a sample image to be classified, and the data to be classified is a sample image to be classified.

In the embodiment of the present invention, the sample image may be an image taken from an image pickup device, or an image stored in advance from a storage space, or an image obtained from a public data set, which is not limited in this embodiment.

Specifically, an input sample in the training sample is a sample image to be classified, so that the sample image to be classified is subjected to image processing based on the visual self-attention model, and a corresponding classification result, namely a prediction result, is obtained. And determining the corresponding accuracy according to the prediction result and the actual classification result, so as to correct the visual self-attention model and obtain the trained visual self-attention model. Correspondingly, the data to be classified is a sample image to be classified, so that the trained visual self-attention model can perform image processing on the sample image to be classified to obtain a classification result corresponding to the classification data, namely, a target classification result.

According to the technical scheme, a channel attention mechanism is added in a backbone network of a visual self-attention model, and a plurality of training samples are obtained, wherein the training samples comprise input samples and actual classification results corresponding to the input samples; for each training sample, inputting an input sample in the current training sample into a visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample; splicing the classifying head vectors and the leachable position vectors corresponding to the characteristic vectors for the characteristic vectors to obtain the characteristic vectors to be input; inputting the feature vector to be input into a visual encoder for information interaction processing to obtain feature information to be fused; processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result; according to a preset execution frequency threshold value, the current training sample is used as a training sample again to be input into the vision self-attention model, and a prediction result corresponding to each execution is obtained; after the training samples of the current batch participate in the training of the visual self-attention model, determining the accuracy of the visual self-attention model after the training of the training samples of the current batch is completed based on the prediction result of each training and the actual classification result in the training samples; when the accuracy is smaller than a preset accuracy threshold, acquiring a next batch of training samples, repeatedly executing the step of training the vision self-attention model, and determining the accuracy corresponding to the next batch of training samples until the accuracy reaches the preset accuracy threshold. Due to the fact that the channel attention mechanism is added, when the visual self-attention model is trained, not only general texture characteristics and position related information but also other information special in the small data set can be learned, information interaction between channels is increased, learning capacity of the visual self-attention model on training samples is improved, and accordingly overfitting phenomenon of the visual self-attention model is restrained.

Example III

Fig. 3 is a schematic structural diagram of a training device for a visual self-attention model according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes: a training sample acquisition module 310, a feature vector determination module 320, a feature vector stitching module 330, an information interaction processing module 340, a prediction result acquisition module 350, and a model iterative training module 360.

A training sample obtaining module 310, configured to obtain a plurality of training samples, where the training samples include an input sample and an actual classification result corresponding to the input sample; a feature vector determining module 320, configured to, for each training sample, input an input sample in the current training sample into the visual self-attention model, and determine a plurality of feature vectors corresponding to the input sample; wherein a channel attention mechanism is added in the backbone network of the visual self-attention model. The feature vector stitching module 330 is configured to stitch the classification head vectors and the learnable position vectors corresponding to the feature vectors for the feature vectors to obtain feature vectors to be input; the information interaction processing module 340 is configured to input the feature vector to be input to the visual encoder for information interaction processing, so as to obtain feature information to be fused; the prediction result obtaining module 350 is configured to process the feature information to be fused based on the multi-layer perceptron to obtain a prediction result; the model iterative training module 360 is configured to repeatedly perform determining, based on the current training sample, that the number of times of the prediction result reaches a preset number of times threshold, and take the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold.

On the basis of the embodiment, the optional feature vector determining module is configured to perform convolution operation on the input sample based on at least one convolution check in the visual self-attention model and flatten the input sample to obtain a plurality of feature vectors corresponding to the input sample; wherein the plurality of feature vectors correspond to different regions of the input sample.

Optionally, the feature vector stitching module is configured to stitch the classification head vectors at first positions of the feature vectors randomly, and determine the learnable position vector according to the display areas of the feature vectors in the input samples, so as to obtain the feature vector to be input.

Optionally, the information interaction processing module is configured to perform attention processing on the feature vector to be input based on a spatial self-attention layer in the visual encoder to obtain a spatially interacted feature vector to be processed; and processing the feature vector to be processed on the channel based on the channel self-attention layer in the visual encoder to obtain the feature information to be fused.

Optionally, after the prediction result obtaining module, the apparatus further includes: and the training sample iteration training module is used for inputting the current training sample into the visual self-attention model again as a training sample according to a preset execution frequency threshold value to obtain a prediction result corresponding to each execution.

Optionally, the apparatus further comprises: and the accuracy rate determining module is used for determining the accuracy rate of the visual self-attention model after the training of the training samples of the current batch is completed based on the prediction result of each training and the actual classification result in the training samples after the training of the training samples of the current batch is completed.

Optionally, the apparatus further comprises: and the accuracy judging module is used for acquiring a next batch of training samples to repeatedly execute the step of training the vision self-attention model when the accuracy is smaller than a preset accuracy threshold, and determining the accuracy corresponding to the next batch of training samples until the accuracy reaches the preset accuracy threshold.

Optionally, the device further comprises a model verification module for obtaining data to be classified; and inputting the data to be classified into the trained visual self-attention model to obtain a target classification result of the data to be classified.

Optionally, in the device, the input sample in the training sample is a sample image to be classified, and the data to be classified is a sample image to be classified.

Optionally, in the accuracy determining module, the number of training samples of the current batch is less than a preset number threshold.

The training device for the visual self-attention model provided by the embodiment of the invention can execute the training method for the visual self-attention model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the training method of the visual self-attention model.

In some embodiments, the method of training the visual self-attention model may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the above-described training method of the visual self-attention model may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the training method of the visual self-attention model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

The computer program for implementing the training method of the visual self-attention model of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

Example five

The embodiment of the invention also provides a computer readable storage medium, the computer readable storage medium stores computer instructions for causing a processor to execute a training method of a visual self-attention model, the method comprising:

Adding a channel attention mechanism in a backbone network of the visual self-attention model; training a visual self-attention model, comprising: obtaining a plurality of training samples, wherein the training samples comprise input samples and actual classification results corresponding to the input samples; for each training sample, inputting an input sample in the current training sample into a visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample; splicing the classifying head vectors and the leachable position vectors corresponding to the characteristic vectors for the characteristic vectors to obtain the characteristic vectors to be input; inputting the feature vector to be input into a visual encoder for information interaction processing to obtain feature information to be fused; processing the feature information to be fused based on the multi-layer perceptron to obtain a prediction result; and repeatedly executing the times of determining the prediction result based on the current training sample to reach a preset times threshold, and taking the next training sample as the current training sample until the accuracy of the visual self-attention model reaches a preset accuracy threshold.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of training a visual self-attention model, comprising:

Training the visual self-attention model, comprising:

For each training sample, inputting an input sample in a current training sample into the visual self-attention model, and determining a plurality of feature vectors corresponding to the input sample;

Splicing the classifying head vectors and the leachable position vectors corresponding to the feature vectors for the feature vectors to obtain the feature vectors to be input;

processing the feature information to be fused based on a multi-layer perceptron to obtain a prediction result;

2. The method of claim 1, wherein the inputting an input sample of the current training samples into the visual self-attention model, determining a plurality of feature vectors corresponding to the input sample, comprises:

Performing convolution operation and flattening on the input sample based on at least one convolution check in the visual self-attention model to obtain a plurality of feature vectors corresponding to the input sample;

wherein the plurality of feature vectors correspond to different regions of the input sample.

3. The method of claim 1, wherein the concatenating the classification head vector for the plurality of feature vectors and the learnable position vector corresponding to the plurality of feature vectors to obtain the feature vector to be input comprises:

And randomly splicing the classification head vectors at the first positions of the plurality of feature vectors, and determining the leachable position vector according to the display areas of the plurality of feature vectors in the input sample to obtain the feature vector to be input.

4. The method of claim 1, wherein the inputting the feature vector to be input to a visual encoder for information interaction processing to obtain feature information to be fused comprises:

performing attention processing on the feature vector to be input based on a spatial self-attention layer in the visual encoder to obtain a spatially interacted feature vector to be processed;

And processing the feature vector to be processed on a channel based on a channel self-attention layer in the visual encoder to obtain feature information to be fused.

5. The method of claim 1, wherein after obtaining the prediction result, the method further comprises:

and inputting the current training sample into the visual self-attention model again as a training sample according to a preset execution frequency threshold value to obtain a prediction result corresponding to each execution.

6. The method according to claim 1, wherein the method further comprises:

After the training samples of the current batch participate in the visual self-attention model training, determining the accuracy of the visual self-attention model after the training samples of the current batch are trained based on the prediction result of each training and the actual classification result in the training samples.

7. The method of claim 6, wherein the method further comprises:

And when the accuracy is smaller than a preset accuracy threshold, acquiring a next batch of training samples, repeatedly executing the step of training the visual self-attention model, and determining the accuracy corresponding to the next batch of training samples until the accuracy reaches the preset accuracy threshold.

8. The method as recited in claim 1, further comprising:

Acquiring data to be classified;

And inputting the data to be classified into a trained visual self-attention model to obtain a target classification result of the data to be classified.

9. The method according to claim 1 or 8, wherein the input samples in the training samples are sample images to be classified, and the data to be classified are sample images to be classified.

10. The method of claim 6, wherein the number of training samples of the current batch is less than a preset number threshold.