WO2023044612A1

WO2023044612A1 - Image classification method and apparatus

Info

Publication number: WO2023044612A1
Application number: PCT/CN2021/119682
Authority: WO
Inventors: 刘宝玉; 王磊; 马晓亮; 程俊
Original assignee: 深圳先进技术研究院; 中国科学院深圳理工大学(筹)
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2023-03-30

Abstract

An image classification method, comprising: acquiring a target image (S201); pre-processing the target image to generate a feature map of a preset format (S202); performing an inverse Fourier transform on the feature map of the preset format to generate a first inverse transform result (S203); splicing the feature map of the preset format and the first inverse transform result to generate a first splicing result (S204); performing feature extraction on the first splicing result to generate a first feature (S205); and determining an image classification result according to the first feature (S206). The problems of long calculation time and large memory occupation of an existing Transformer neural network model can be solved.

Description

A method and device for image classification

technical field

The present application relates to the field of image processing, and in particular to an image classification method and device.

Background technique

Transformer is a deep neural network based on a self-attention mechanism, which is not only used in the field of natural language processing but also in the field of image processing, such as converting two-dimensional image data into one-dimensional sequences and extracting multi-scale features from two-dimensional images. However, the Transformer neural network model is extremely complex, resulting in a large memory footprint and a long training time for the neural network model.

Therefore, how to reduce the calculation time and memory usage of the existing Transformer neural network model is an urgent problem to be solved.

technical problem

One of the purposes of the embodiments of the present application is to provide a method and device for image classification, aiming at solving the problems of long calculation time and large memory usage of the existing Transformer neural network model.

technical solution

The technical scheme that the embodiment of the present application adopts is:

In the first aspect, an image classification method is provided, including: acquiring a target image; performing preprocessing on the target image to generate a feature map in a preset format; performing Fourier inversion on the feature map in a preset format Transform to generate a first inverse transform result; splicing the feature map in the preset format and the first inverse transform result to generate a first splicing result; performing feature extraction on the first splicing result to generate a first Features; determining an image classification result according to the first features.

The above method can be executed by a chip on an electronic device. Compared with the complex self-attention layer used in the existing Transformer neural network model to perform multiple convolutions on the feature map of the preset format, this application uses the complex self-attention layer in the existing Transformer neural network model to use Fourier Instead of inverse transformation, a new Transformer neural network model is formed; the new Transformer neural network model only needs to perform one inverse Fourier transform on the feature map in the preset format without performing multiple convolutions to extract the features of the target image and Determine the image classification result of the target image; compared with the existing Transformer neural network model, the new Transformer neural network model reduces the calculation time and memory usage in the process of feature extraction and image classification of the target image.

Optionally, the determining the image classification result according to the first feature includes: performing splicing processing on the first feature and the first inverse transformation result to generate a second splicing result; The second splicing result is classified, wherein any classification network in the at least one classification network includes a block merging module, a first normalization layer, a Fourier layer, a second normalization layer and a multi-layer perceptron , the block merging module is used for merging the data input to the classification network, the first normalization layer is used for normalizing the output result of the block merging module, and the Fourier The layer is used to perform inverse Fourier transform processing on the output result of the first normalization layer, and the second normalization layer is used to perform the inverse Fourier transform on the output result of the block merging module and the Fourier layer The splicing result is subjected to normalization processing, and the multi-layer perceptron is used to perform feature extraction processing on the output result of the second normalization layer.

The complex self-attention layer in the above classification network is replaced by the Fourier layer to form a new classification network; the new classification network performs image classification on the target image; compared with the existing classification network, the new classification network performs image classification on the target image Computational time and memory usage have been reduced for image classification.

Optionally, the output result of the at least one classification network is processed through a softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category.

Optionally, the inverse Fourier transform is an inverse fast Fourier transform. Replacing the complex self-attention layer with inverse fast Fourier transform can speed up the operation speed of the new Transformer neural network model and classification network.

Optionally, the formula of the inverse Fourier transform is:

Among them, x _n is the discrete signal in the time domain, x _k is the discrete signal in the frequency domain, N is the number of sampling points in the time domain, and k is the current sampling point.

Optionally, the preprocessing of the target image includes: performing block segmentation processing on the target image to generate a block segmentation result; processing the block segmentation result through a linear embedding layer to generate the preset Format feature maps.

In a second aspect, an image classification device is provided, including a module for performing any one of the methods in the first aspect.

In a third aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the first aspect. Methods.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments or exemplary technical descriptions. Obviously, the accompanying drawings in the following descriptions are only for this application. For some embodiments, those skilled in the art can also obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an image classification system in an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a method for classifying images in an embodiment of the present application;

Fig. 3 is the structural representation of new Transformer neural network model in the embodiment of the present application;

FIG. 4 is a schematic diagram of an image classification device in an embodiment of the present application.

Embodiment of this application

In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.

It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.

Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Accordingly, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc., in various places in this specification are not necessarily all References to the same embodiment mean "one or more but not all" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.

With the rapid development of image processing technology, the Transformer neural network model is used in the field of image processing. However, the complex attention layer in the existing Transformer neural network model leads to long calculation time and large memory usage of the model. In order to solve the problem of long calculation time and large memory usage of the existing Transformer neural network model, this application replaces the complex self-attention layer in the existing Transformer neural network model with an inverse Fourier transform layer to form a new Transformer neural network model. network model. The new Transformer neural network model solves the problems of long calculation time and large memory occupation of the existing Transformer neural network model.

The present application will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

Fig. 1 shows a kind of image classification system provided by the present application, this system comprises input unit 101, preprocessing unit 102, neural network model 103 and output unit 104, wherein, input unit 101 is used for receiving the target image of user input, And the target image is sent to the preprocessing unit 102; the preprocessing unit 102 performs image segmentation and matrix transformation preprocessing on the target image, and sends the feature map of the preset format to the neural network model 103; the neural network model 103 The feature map in the preset format is subjected to feature extraction and image classification processing, and the image classification result is sent to the output unit 104; the output unit 104 is used to receive the image classification result sent by the neural network model 103, and output the image classification result to the client , so that users can view the classification results.

In order to reduce the calculation time and memory usage of the existing Transformer neural network model, the present application proposes a method for image classification, as shown in Figure 2, the method can be executed by a chip in an electronic device, and the method includes:

S201. Acquire a target image.

Exemplarily, the electronic device may acquire the target image to be classified input by the user through the input unit 101 shown in FIG. 1 . The target image may be an image of any size, for example, a target image with a size of 224×224, and this application does not impose any limitation on the size of the target image.

S202. Preprocessing the target image to generate a feature map in a preset format.

Exemplarily, the electronic device preprocesses the target image through the preprocessing unit 102 to generate a feature map in a preset format. The feature map in the preset format is the output result of the preprocessing unit 102, and is used for inverse Fourier transform. The above-mentioned preprocessing unit 102 includes a block segmentation module and a linear embedding layer, wherein the block segmentation module is used to perform block segmentation processing on the target image; the linear embedding layer is used to perform matrix transformation on the block segmentation result. For example, the user inputs a 3-channel (i.e., RGB three-channel) target image with a size of 224×224 through the input unit 101, and the block segmentation module performs block segmentation processing on the 224×224 target image, and divides the target image into 56 ×56 blocks, each block is a 3-channel image with a size of 4×4, that is, the output result of the block segmentation module (ie, the block segmentation result) is an image of 56 × 56 × 3; the image of 56 × 56 × 3 is input to the linear The embedding layer performs channel expansion and matrix transformation to obtain a 64×49×128 image (that is, a feature map in a preset format), where 128 represents the number of channels, that is, the linear embedding layer performs 128 channel expansion on a 56×56×3 image , to get a 56×56×128 image; then perform matrix transformation on the 56×56×128 image to obtain a 64×49×128 image; or, perform a matrix transformation on the 56×56×3 image in the linear embedding layer to obtain 64 ×49×3 image; then perform 128-channel expansion on the 64×49×3 image to obtain a 64×49×128 image.

Exemplarily, the electronic device sends the feature map of the preset format output by the preprocessing unit 102 to the neural network model 103, and the neural network model 103 is a new Transformer neural network model, including: a sub-network 201 and at least one classification network, such as , the new Transformer neural network model shown in Figure 3 includes a sub-network 201 and 3 classification networks, and these 3 classification networks are respectively a classification network 202, a classification network 203 and a classification network 204; the above-mentioned sub-network 201 includes a normalization layer 2011, Fourier layer 2012, normalization layer 2013 and multi-layer perceptron 2014, the normalization layer 2011 is used to normalize the feature map of the preset format sent by the preprocessing unit 102; Fourier layer 2012 is used to perform Fourier inverse transform on the output result of the normalization layer 2011; the normalization layer 2013 is used to perform normalization operation on the output result of the Fourier layer 2012; The result output by the normalization layer 2013 is subjected to feature extraction. For example, the electronic device sends the feature map in a preset format output by the pre-processing unit 102 to the normalization layer 2021 of the sub-network 201 for normalization processing to generate a first normalization result.

S203. Perform an inverse Fourier transform on the feature map in a preset format to generate a first inverse transform result.

Exemplarily, the normalization layer 2011 in the subnetwork 201 first performs a normalization operation on the feature map in the preset format, and then sends the first normalization result output by the normalization layer 2011 to the Fourier layer 2012; Fourier layer 2012 receives the first normalization result of the normalization layer 2011, and performs Fourier inverse transform on the first normalization result to obtain the first inverse transform result, the above-mentioned Fourier inverse transform The formula for is:

Among them, x _n is the discrete signal in the time domain, x _k is the discrete signal in the frequency domain, N is the number of sampling points in the time domain, and k is the current sampling point. The Fourier layer 2012 in the new Transformer neural network model replaces the complex self-attention layer in the existing Transformer neural network model. Compared with the existing Transformer neural network model, the new Transformer neural network model performs a Fu Liye inverse transform can extract image features without multiple complex convolution operations to extract image features and classify target images. Therefore, the new Transformer neural network model calculates time and Memory usage has been reduced.

Optionally, the Fourier layer 2012 may also perform inverse fast Fourier transform on the first normalization result to obtain the first inverse transform result. Performing inverse fast Fourier transform on the result of the first normalization can accelerate the operation speed of the new Transformer neural network model.

S204. Concatenate the feature map in the preset format and the first inverse transformation result to generate a first concatenation result.

Exemplarily, the sub-network 201 splices the feature map in the preset format and the first inverse transform result output by the Fourier layer 2012 to obtain the first splicing result (that is, the spliced image). The above splicing process refers to the preset The feature map of the format and the corresponding pixels of the first inverse transformation result are summed. For example, the preprocessing unit 102 outputs the value of the pixel at position A in the 64×49×128 image as X1, the first inverse transformation result is the value of the pixel at position A in the 64×49×128 image as X2, and the preset format The feature map and the first inverse transformation result are spliced, that is, the preprocessing unit 102 outputs the value of the pixel at position A in the image of 64×49×128 and the value of the pixel at position A in the image whose first inverse transformation result is 64×49×128 Add (ie X1+X2). The concatenated result of the feature map in the preset format and the first inverse transformation result is the first concatenated result (ie X1+X2). The splicing process of the feature map in the preset format and the pixel values at other positions in the first inverse transformation result is the same as the splicing process of the pixel value at position A, and will not be repeated here.

S205, performing feature extraction on the first splicing result to generate a first feature.

Exemplarily, the multi-layer perceptron 2014 (also known as multi-layer perceptron) of the sub-network 201 is used to perform feature extraction on the first splicing result to obtain the first feature. The above-mentioned multi-layer perceptron 2014 is a neural network composed of fully connected layers containing at least one hidden layer, and the output of each hidden layer is transformed by an activation function. The number of layers of the multilayer perceptron 2014 and the number of hidden units in each hidden layer are hyperparameters. For example, the first stitching result is a stitched image of 64×49×128, and the multilayer perceptron 2014 performs feature extraction on the stitched image of 64×49×128 to obtain a feature map (first feature) of 64×49×128.

S206. Determine an image classification result according to the first feature.

Exemplarily, at least one classification network in the neural network model 103 performs image classification processing on the first feature. Taking the neural network model 103 shown in Figure 3 (i.e., the new Transformer neural network model) as an example, the neural network model 103 includes a subnetwork 201 and 3 classification networks, and these 3 classification networks are respectively the classification network 202, the classification network 203 and a classification network 204, the above-mentioned classification network 202 includes a block merging module 2021, a first normalization layer 2022, a Fourier layer 2023, a second normalization layer 2024 and a multi-layer perceptron 2025, and the above-mentioned block merging module 2021 is used To perform merging processing, channel expansion and matrix transformation on the data input to the classification network 202, the first normalization layer 2022 is used to normalize the output results of the block merging module 2021, and the Fourier layer 2023 is used to The output result of the first normalization layer 2022 is processed by inverse Fourier transform, and the second normalization layer 2024 is used to perform normalization processing on the splicing results of the output results of the block merging module 2021 and the Fourier layer 2023. The layer perceptron 2025 is used to perform feature extraction processing on the output result of the second normalization layer.

For example, the block merging module 2021 in the above classification network 202 is used to perform merging processing, channel expansion and matrix transformation on the first features output by the sub-network 201, for example, the first feature output by the sub-network 201 is a 64×49×128 Feature map, block merging module 2021 performs pairwise merging of the image blocks in the first feature to obtain a 28×28×128 image; then expands the 28×28×128 image to 256 channels to obtain 28×28×256 The image of 28 × 28 × 256 is then matrix transformed to obtain an image of 16 × 49 × 256; or, the block merging module 2021 performs 256-channel expansion on the image block in the first feature to obtain an image of 64 × 49 × 256 image; then merge the 64×49×256 image in pairs to obtain a 28×28×256 image; then perform matrix transformation on the 28×28×256 image to obtain a 16×49×256 image. The above-mentioned first normalization layer 2022 normalizes the 16×49×256 image output by the block merging module 2021 to obtain a normalized 16×49×256 image; the Fourier layer 2023 normalizes The transformed 16×49×256 image is processed by inverse Fourier transform to obtain a 16×49×256 image after inverse Fourier transform; the second normalization layer 2024 is for the 16×49× output of the block merging module 2021 256 images and the stitching results of 16×49×256 images after inverse Fourier transform are normalized to obtain 16×49×256 images after the second normalization; Feature extraction is performed on the transformed 16×49×256 image to obtain a feature map (second feature) of 16×49×256.

The above-mentioned classification network 203 includes a block merging module 2031, a first normalization layer 2032, a Fourier layer 2033, a second normalization layer 2034 and a multi-layer perceptron 2035, and the above-mentioned block merging module 2031 is used for input classification network 203 Combine processing, channel expansion and matrix transformation of the data, the first normalization layer 2032 is used to normalize the output of the block merging module 2031, and the Fourier layer 2033 is used to normalize the first normalization layer 2032 Inverse Fourier transform processing is performed on the output result of the block merging module 2031 and the output result of the Fourier layer 2033, and the second normalization layer 2034 is used for normalization processing. Perform feature extraction processing on the output result of the second normalization layer.

For example, the block merging module 2031 in the above-mentioned classification network 203 is used to perform merging processing, channel expansion and matrix transformation on the second feature output by the classification network 202, for example, the second feature output by the classification network 202 is a 16×49×256 Feature map, the block merging module 2031 performs pairwise merging processing on the image blocks in the second feature to obtain a 14×14×256 image; then expands the 14×14×256 image to 512 channels to obtain 14×14×512 The image of 14 × 14 × 512 is then matrix transformed to obtain a 2 × 49 × 512 image; or, the block merging module 2031 performs 512-channel expansion on the image block in the second feature to obtain a 16 × 49 × 512 image image; then merge the 16×49×512 images in pairs to obtain a 14×14×512 image; then perform matrix transformation on the 14×14×512 image to obtain a 2×49×512 image. The above-mentioned first normalization layer 2032 normalizes the 2×49×512 image output by the block merging module 2031 to obtain a normalized 2×49×512 image; the Fourier layer 2033 normalizes The transformed 2×49×512 image is processed by inverse Fourier transform to obtain a 2×49×512 image after inverse Fourier transform; the second normalization layer 2034 performs the 2×49× 512 images and 2×49×512 images after inverse Fourier transform are normalized to obtain the second normalized 2×49×512 images; the multi-layer perceptron 2035 Feature extraction is performed on the transformed 2×49×512 image to obtain a 2×49×512 feature map (the third feature).

The above-mentioned classification network 204 includes a block merging module 2041, a first normalization layer 2042, a Fourier layer 2043, a second normalization layer 2044 and a multi-layer perceptron 2045, and the above-mentioned block merging module 2041 is used for input classification network 204 Combine processing, channel expansion and matrix transformation of the data, the first normalization layer 2042 is used to normalize the output result of the block merging module 2041, and the Fourier layer 2043 is used to normalize the first normalization layer 2042 Inverse Fourier transform processing is performed on the output result of the block merging module 2041 and the output result of the Fourier layer 2043, and the second normalization layer 2044 is used for normalization processing. Perform feature extraction processing on the output result of the second normalization layer.

For example, the block merging module 2041 in the above-mentioned classification network 204 is used to merge and process the third feature output by the classification network 203 and channel expansion, for example, the third feature output by the classification network 203 is a feature map of 2×49×512, The block merging module 2041 performs pairwise merging processing on the image blocks in the third feature to obtain a 7×7×512 image; then expands the 7×7×512 image by 1024 channels to obtain a 7×7×1024 image; Then perform matrix transformation on the 7×7×1024 image to obtain a 1×49×1024 image; or, the block merging module 2041 performs 1024 channel expansion on the image block in the third feature to obtain a 2×49×1024 image; then The 2×49×1024 images are combined in pairs to obtain a 7×7×1024 image; then the 7×7×1024 image is transformed into a matrix to obtain a 1×49×1024 image. The above-mentioned first normalization layer 2042 normalizes the 1×49×1024 image output by the block merging module 2041 to obtain a normalized 1×49×1024 image; the Fourier layer 2043 normalizes The transformed 1×49×1024 image is subjected to inverse Fourier transform processing to obtain a 1×49×1024 image after inverse Fourier transform; the second normalization layer 2044 outputs the 1×49× The splicing result of the 1024 image and the 1×49×1024 image after the inverse Fourier transform is normalized to obtain the 1×49×1024 image after the second normalization; the multilayer perceptron 2045 performs the second normalization Feature extraction is performed on the transformed 1×49×1024 image to obtain a 1×49×1024 feature map (the fourth feature). It can be seen that the application replaces the complex self-attention layer in the classification network 202, the classification network 203 and the classification network 204 with the Fourier layer, which not only reduces the complexity of the classification network 202, the classification network 203 and the classification network 204, It also saves computing time and memory usage.

Exemplarily, at least one output result of the classification network is processed through a Softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category. The Softmax function processes the result of at least one classification network output in the neural network model 103 to obtain at least one probability value. For example, the Softmax function outputs the fourth feature map (i.e. 1×49× 1024 feature map) to obtain the probability of the fourth feature map corresponding to different picture categories (that is, at least one probability value); sort the probability of the fourth feature map corresponding to different picture categories from high to low, and the fourth feature The maximum probability of the picture category corresponding to the map is determined as the picture category corresponding to the fourth feature map (ie, the probability of at least one image category). For example, there are 3 different picture categories, and these 3 picture categories are animals, people and landscapes respectively. If the target picture is a natural scenery, then the target picture passes through the fourth feature map output by the neural network model 103, and then passes through The Softmax function will output the probabilities of the fourth feature map corresponding to the above three image categories. For example, if the fourth feature map has an animal probability of 20%, the fourth feature map has a human probability of 40%, and the fourth feature The probability that the map is landscape is 90%, then the fourth feature map corresponds to the probability of animals, people and landscapes in order from high to low (that is, the probability of landscape > the probability of people > the probability of animals) , and finally determine the fourth feature map as the landscape class (that is, the image classification result of the target image in Figure 3).

Exemplarily, the neural network model 103 is trained on the first 50 categories of the ImageNet-1K data set, wherein the size of the training set is 64817 pictures, the size of the verification set is 2500 pictures, and there are 50 categories in total. During training, the network prediction result (i.e. the output result of the neural network model 103) is input to the Softmax function and converted into the output result of the neural network model 103 corresponding to different category probability values, the formula is:

The calculation result of the Softmax function is used to calculate the loss, the formula is:

Among them, p( _xi ) is the real probability distribution, and q( _xi ) is the predicted probability distribution. During training, use the Adam optimizer and the cosine decay learning rate scheduler to train for 300 cycles (that is, the number of times the training data set is cyclically trained), and each 64 pictures is a group (that is, the batch size (batch size) is set to 64) Input to the neural network model 103 to be trained for training, until all 64817 pictures are trained, the training is considered as the end.

After the neural network model 103 is trained, the performance of the neural network model 103 is tested using the verification set. Table 1 shows the image classification results when the neural network model 103 and the Swin-B method perform image classification on the top 50 categories of the ImageNet-1K data set, where Methods is the name of the method; ImageSize is the size of the input image during training; Param indicates the number of training parameters; Throughput (image/s) is throughput (that is, the ability to process pictures per second); ImageNet indicates the accuracy of image classification on the ImageNet dataset, that is, the picture category with the highest probability and the actual picture The accuracy rate of the category matching; FLOPs represents the computing power required by the network model, and FLOPs (floating-point operations) floating-point operations are used to measure the complexity of the network model; Step represents the time required for each training step when the batch size is 64 Time; Epoch indicates the time required for each training cycle when the batch size is 64. The required time is obtained by testing on 4 GPUs (GeForce GTX1080). Compared with the Swin-B method, the parameters of the neural network model 103 of the present application are reduced by 32%, the required floating-point operations are reduced by 33.7%, the time is reduced by 32.6%, and the accuracy reaches 90%.

Table 1 Comparison of the top 50 classes with other methods on the ImageNet-1 dataset

Compared with the complex self-attention layer used in the existing Transformer neural network model to perform multiple convolutions on the feature map of the preset format, this application uses the complex self-attention layer in the existing Transformer neural network model to use Fourier Instead of inverse transformation, a new Transformer neural network model is formed; the new Transformer neural network model only needs to perform one inverse Fourier transform on the feature map in the preset format without performing multiple convolutions to extract the features of the target image and Determine the image classification result of the target image; compared with the existing Transformer neural network model, the new Transformer neural network model reduces the calculation time and memory usage in the process of feature extraction and image classification of the target image.

In addition, this application also replaces the complex self-attention layer in the above-mentioned classification network with a Fourier layer to form a new classification network; the new classification network performs image classification on the target image; compared with the existing classification network, the new classification network Computational time and memory usage are both reduced when performing image classification on target images.

FIG. 4 shows a schematic structural diagram of an image classification device provided by the present application. The dotted line in Figure 4 indicates that the unit or the module is optional. The device 400 may be used to implement the methods described in the foregoing method embodiments. The apparatus 400 may be a terminal device or a server or a chip.

The device 400 includes one or more processors 401, and the one or more processors 401 can support the device 400 to implement the method in the method embodiment corresponding to FIG. 2 . The processor 401 may be a general purpose processor or a special purpose processor. For example, the processor 401 may be a central processing unit (central processing unit, CPU). The CPU can be used to control the device 400, execute software programs, and process data of the software programs. The device 400 may further include a communication unit 405, configured to implement signal input (reception) and output (transmission).

For example, the apparatus 400 may be a chip, and the communication unit 405 may be an input and/or output circuit of the chip, or the communication unit 405 may be a communication interface of the chip, and the chip may serve as a component of a terminal device.

For another example, the apparatus 400 may be a terminal device, and the communication unit 405 may be a transceiver of the terminal device, or the communication unit 405 may be a transceiver circuit of the terminal device.

The device 400 may include one or more memories 402, on which there is a program 404, which can be run by the processor 401 to generate instructions 403, so that the processor 401 executes the methods described in the above method embodiments according to the instructions 403. Optionally, data (such as the ID of the chip to be tested) may also be stored in the memory 402 . Optionally, the processor 401 may also read data stored in the memory 402 , the data may be stored in the same storage address as the program 404 , and the data may also be stored in a different storage address from the program 404 .

The processor 401 and the memory 402 may be set separately, or may be integrated together, for example, integrated on a system-on-chip (system on chip, SOC) of a terminal device.

For a specific manner in which the processor 401 executes the method for starting the burn-in test, reference may be made to relevant descriptions in the method embodiments.

It should be understood that the steps in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor 401 . The processor 401 may be a CPU, a digital signal processor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices or discrete hardware components.

The present application also provides a computer program product, which implements the method described in any method embodiment in the present application when the computer program product is executed by the processor 401 .

The computer program product can be stored in the memory 402 , such as a program 404 , and the program 404 is finally converted into an executable target file that can be executed by the processor 401 through processes such as preprocessing, compiling, assembling and linking.

The present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in the present application is implemented. The computer program may be a high-level language program or an executable object program.

The computer readable storage medium is, for example, the memory 402 . The memory 402 may be a volatile memory or a nonvolatile memory, or, the memory 402 may include both a volatile memory and a nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM ) and direct memory bus random access memory (direct rambus RAM, DRRAM).

Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process and technical effects of the devices and equipment described above can refer to the corresponding processes and technical effects in the foregoing method embodiments, here No longer.

In the several embodiments provided in this application, the disclosed systems, devices and methods may be implemented in other ways. For example, some features of the method embodiments described above may be omitted, or not implemented. The device embodiments described above are only illustrative, and the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system. In addition, the coupling between the various units or the coupling between the various components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.

The above-mentioned embodiments are only used to illustrate the technical solution of the present application, but not to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: they can still modify the technical solutions described in the aforementioned embodiments, or perform equivalent replacements for some of the technical features, and these Any modification or replacement that does not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present application shall be included within the protection scope of the present application.

Claims

A method for image classification, characterized in that the method comprises:

Get the target image;

Preprocessing the target image to generate a feature map in a preset format;

performing an inverse Fourier transform on the feature map in the preset format to generate a first inverse transform result;

performing splicing processing on the feature map in the preset format and the first inverse transformation result to generate a first splicing result;

performing feature extraction on the first splicing result to generate a first feature;

An image classification result is determined according to the first feature.
The method according to claim 1, wherein determining the image classification result according to the first feature comprises:

performing splicing processing on the first feature and the first inverse transformation result to generate a second splicing result;

The second splicing result is classified by at least one classification network, wherein any classification network in the at least one classification network includes a block merging module, a first normalization layer, a Fourier layer, a second normalization A normalization layer and a multi-layer perceptron, the block merging module is used to merge the data input to the classification network, and the first normalization layer is used to normalize the output results of the block merging module processing, the Fourier layer is used to perform inverse Fourier transform processing on the output result of the first normalization layer, and the second normalization layer is used to perform the inverse Fourier transform on the block merging module and the Fourier The splicing result of the output result of the inner leaf layer is normalized, and the multi-layer perceptron is used to perform feature extraction processing on the output result of the second normalization layer.
The method according to claim 2, further comprising:

The output result of the at least one classification network is processed through a softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category.
The method according to any one of claims 1 to 3, wherein the inverse Fourier transform is an inverse fast Fourier transform.
The method according to any one of claims 1 to 3, wherein the formula of the inverse Fourier transform is:

Among them, x n is the discrete signal in the time domain, x k is the discrete signal in the frequency domain, N is the number of sampling points in the time domain, and k is the current sampling point.
The method according to any one of claims 1 to 3, wherein said preprocessing the target image comprises:

performing block segmentation processing on the target image to generate a block segmentation result;

The block segmentation result is processed through a linear embedding layer to generate a feature map in the preset format.
An image classification device, characterized in that the device includes a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, so that the The device performs the method of any one of claims 1 to 6.
A computer-readable storage medium, characterized in that, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the processor is made to execute any one of claims 1 to 6. described method.