WO2023044612A1 - Image classification method and apparatus - Google Patents

Image classification method and apparatus Download PDF

Info

Publication number
WO2023044612A1
WO2023044612A1 PCT/CN2021/119682 CN2021119682W WO2023044612A1 WO 2023044612 A1 WO2023044612 A1 WO 2023044612A1 CN 2021119682 W CN2021119682 W CN 2021119682W WO 2023044612 A1 WO2023044612 A1 WO 2023044612A1
Authority
WO
WIPO (PCT)
Prior art keywords
result
image
layer
feature
classification
Prior art date
Application number
PCT/CN2021/119682
Other languages
French (fr)
Chinese (zh)
Inventor
刘宝玉
王磊
马晓亮
程俊
Original Assignee
深圳先进技术研究院
中国科学院深圳理工大学(筹)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院, 中国科学院深圳理工大学(筹) filed Critical 深圳先进技术研究院
Priority to PCT/CN2021/119682 priority Critical patent/WO2023044612A1/en
Publication of WO2023044612A1 publication Critical patent/WO2023044612A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space

Definitions

  • the present application relates to the field of image processing, and in particular to an image classification method and device.
  • Transformer is a deep neural network based on a self-attention mechanism, which is not only used in the field of natural language processing but also in the field of image processing, such as converting two-dimensional image data into one-dimensional sequences and extracting multi-scale features from two-dimensional images.
  • the Transformer neural network model is extremely complex, resulting in a large memory footprint and a long training time for the neural network model.
  • One of the purposes of the embodiments of the present application is to provide a method and device for image classification, aiming at solving the problems of long calculation time and large memory usage of the existing Transformer neural network model.
  • an image classification method including: acquiring a target image; performing preprocessing on the target image to generate a feature map in a preset format; performing Fourier inversion on the feature map in a preset format Transform to generate a first inverse transform result; splicing the feature map in the preset format and the first inverse transform result to generate a first splicing result; performing feature extraction on the first splicing result to generate a first Features; determining an image classification result according to the first features.
  • the above method can be executed by a chip on an electronic device.
  • this application uses the complex self-attention layer in the existing Transformer neural network model to use Fourier Instead of inverse transformation, a new Transformer neural network model is formed; the new Transformer neural network model only needs to perform one inverse Fourier transform on the feature map in the preset format without performing multiple convolutions to extract the features of the target image and Determine the image classification result of the target image; compared with the existing Transformer neural network model, the new Transformer neural network model reduces the calculation time and memory usage in the process of feature extraction and image classification of the target image.
  • the determining the image classification result according to the first feature includes: performing splicing processing on the first feature and the first inverse transformation result to generate a second splicing result;
  • the second splicing result is classified, wherein any classification network in the at least one classification network includes a block merging module, a first normalization layer, a Fourier layer, a second normalization layer and a multi-layer perceptron , the block merging module is used for merging the data input to the classification network, the first normalization layer is used for normalizing the output result of the block merging module, and the Fourier The layer is used to perform inverse Fourier transform processing on the output result of the first normalization layer, and the second normalization layer is used to perform the inverse Fourier transform on the output result of the block merging module and the Fourier layer
  • the splicing result is subjected to normalization processing, and the multi-layer perceptron is used to perform feature extraction processing on the output result of the second normalization layer.
  • the complex self-attention layer in the above classification network is replaced by the Fourier layer to form a new classification network; the new classification network performs image classification on the target image; compared with the existing classification network, the new classification network performs image classification on the target image Computational time and memory usage have been reduced for image classification.
  • the output result of the at least one classification network is processed through a softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category.
  • the inverse Fourier transform is an inverse fast Fourier transform. Replacing the complex self-attention layer with inverse fast Fourier transform can speed up the operation speed of the new Transformer neural network model and classification network.
  • the formula of the inverse Fourier transform is:
  • x n is the discrete signal in the time domain
  • x k is the discrete signal in the frequency domain
  • N is the number of sampling points in the time domain
  • k is the current sampling point.
  • the preprocessing of the target image includes: performing block segmentation processing on the target image to generate a block segmentation result; processing the block segmentation result through a linear embedding layer to generate the preset Format feature maps.
  • an image classification device including a module for performing any one of the methods in the first aspect.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the first aspect. Methods.
  • FIG. 1 is a schematic structural diagram of an image classification system in an embodiment of the present application
  • FIG. 2 is a schematic flow diagram of a method for classifying images in an embodiment of the present application
  • Fig. 3 is the structural representation of new Transformer neural network model in the embodiment of the present application.
  • FIG. 4 is a schematic diagram of an image classification device in an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Accordingly, appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc., in various places in this specification are not necessarily all References to the same embodiment mean “one or more but not all” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • the Transformer neural network model is used in the field of image processing.
  • the complex attention layer in the existing Transformer neural network model leads to long calculation time and large memory usage of the model.
  • this application replaces the complex self-attention layer in the existing Transformer neural network model with an inverse Fourier transform layer to form a new Transformer neural network model. network model.
  • the new Transformer neural network model solves the problems of long calculation time and large memory occupation of the existing Transformer neural network model.
  • Fig. 1 shows a kind of image classification system provided by the present application, this system comprises input unit 101, preprocessing unit 102, neural network model 103 and output unit 104, wherein, input unit 101 is used for receiving the target image of user input, And the target image is sent to the preprocessing unit 102; the preprocessing unit 102 performs image segmentation and matrix transformation preprocessing on the target image, and sends the feature map of the preset format to the neural network model 103; the neural network model 103 The feature map in the preset format is subjected to feature extraction and image classification processing, and the image classification result is sent to the output unit 104; the output unit 104 is used to receive the image classification result sent by the neural network model 103, and output the image classification result to the client , so that users can view the classification results.
  • the present application proposes a method for image classification, as shown in Figure 2, the method can be executed by a chip in an electronic device, and the method includes:
  • the electronic device may acquire the target image to be classified input by the user through the input unit 101 shown in FIG. 1 .
  • the target image may be an image of any size, for example, a target image with a size of 224 ⁇ 224, and this application does not impose any limitation on the size of the target image.
  • the electronic device preprocesses the target image through the preprocessing unit 102 to generate a feature map in a preset format.
  • the feature map in the preset format is the output result of the preprocessing unit 102, and is used for inverse Fourier transform.
  • the above-mentioned preprocessing unit 102 includes a block segmentation module and a linear embedding layer, wherein the block segmentation module is used to perform block segmentation processing on the target image; the linear embedding layer is used to perform matrix transformation on the block segmentation result.
  • the user inputs a 3-channel (i.e., RGB three-channel) target image with a size of 224 ⁇ 224 through the input unit 101, and the block segmentation module performs block segmentation processing on the 224 ⁇ 224 target image, and divides the target image into 56 ⁇ 56 blocks, each block is a 3-channel image with a size of 4 ⁇ 4, that is, the output result of the block segmentation module (ie, the block segmentation result) is an image of 56 ⁇ 56 ⁇ 3; the image of 56 ⁇ 56 ⁇ 3 is input to the linear
  • the embedding layer performs channel expansion and matrix transformation to obtain a 64 ⁇ 49 ⁇ 128 image (that is, a feature map in a preset format), where 128 represents the number of channels, that is, the linear embedding layer performs 128 channel expansion on a 56 ⁇ 56 ⁇ 3 image , to get a 56 ⁇ 56 ⁇ 128 image; then perform matrix transformation on the 56 ⁇ 56 ⁇ 128 image to obtain a 64 ⁇ 49 ⁇ 128 image; or, perform a matrix transformation
  • the electronic device sends the feature map of the preset format output by the preprocessing unit 102 to the neural network model 103
  • the neural network model 103 is a new Transformer neural network model, including: a sub-network 201 and at least one classification network, such as , the new Transformer neural network model shown in Figure 3 includes a sub-network 201 and 3 classification networks, and these 3 classification networks are respectively a classification network 202, a classification network 203 and a classification network 204;
  • the above-mentioned sub-network 201 includes a normalization layer 2011, Fourier layer 2012, normalization layer 2013 and multi-layer perceptron 2014, the normalization layer 2011 is used to normalize the feature map of the preset format sent by the preprocessing unit 102;
  • Fourier layer 2012 is used to perform Fourier inverse transform on the output result of the normalization layer 2011;
  • the normalization layer 2013 is used to perform normalization operation on the output result of the Fourier layer 2012;
  • the result output by the normalization layer 2013 is subjected
  • the normalization layer 2011 in the subnetwork 201 first performs a normalization operation on the feature map in the preset format, and then sends the first normalization result output by the normalization layer 2011 to the Fourier layer 2012;
  • Fourier layer 2012 receives the first normalization result of the normalization layer 2011, and performs Fourier inverse transform on the first normalization result to obtain the first inverse transform result, the above-mentioned Fourier inverse transform
  • the formula for is:
  • x n is the discrete signal in the time domain
  • x k is the discrete signal in the frequency domain
  • N is the number of sampling points in the time domain
  • k is the current sampling point.
  • the Fourier layer 2012 in the new Transformer neural network model replaces the complex self-attention layer in the existing Transformer neural network model.
  • the new Transformer neural network model performs a Fu Liye inverse transform can extract image features without multiple complex convolution operations to extract image features and classify target images. Therefore, the new Transformer neural network model calculates time and Memory usage has been reduced.
  • the Fourier layer 2012 may also perform inverse fast Fourier transform on the first normalization result to obtain the first inverse transform result. Performing inverse fast Fourier transform on the result of the first normalization can accelerate the operation speed of the new Transformer neural network model.
  • the sub-network 201 splices the feature map in the preset format and the first inverse transform result output by the Fourier layer 2012 to obtain the first splicing result (that is, the spliced image).
  • the above splicing process refers to the preset The feature map of the format and the corresponding pixels of the first inverse transformation result are summed.
  • the preprocessing unit 102 outputs the value of the pixel at position A in the 64 ⁇ 49 ⁇ 128 image as X1, the first inverse transformation result is the value of the pixel at position A in the 64 ⁇ 49 ⁇ 128 image as X2, and the preset format
  • the feature map and the first inverse transformation result are spliced, that is, the preprocessing unit 102 outputs the value of the pixel at position A in the image of 64 ⁇ 49 ⁇ 128 and the value of the pixel at position A in the image whose first inverse transformation result is 64 ⁇ 49 ⁇ 128 Add (ie X1+X2).
  • the concatenated result of the feature map in the preset format and the first inverse transformation result is the first concatenated result (ie X1+X2).
  • the splicing process of the feature map in the preset format and the pixel values at other positions in the first inverse transformation result is the same as the splicing process of the pixel value at position A, and will not be repeated here.
  • the multi-layer perceptron 2014 (also known as multi-layer perceptron) of the sub-network 201 is used to perform feature extraction on the first splicing result to obtain the first feature.
  • the above-mentioned multi-layer perceptron 2014 is a neural network composed of fully connected layers containing at least one hidden layer, and the output of each hidden layer is transformed by an activation function.
  • the number of layers of the multilayer perceptron 2014 and the number of hidden units in each hidden layer are hyperparameters.
  • the first stitching result is a stitched image of 64 ⁇ 49 ⁇ 128, and the multilayer perceptron 2014 performs feature extraction on the stitched image of 64 ⁇ 49 ⁇ 128 to obtain a feature map (first feature) of 64 ⁇ 49 ⁇ 128.
  • the neural network model 103 performs image classification processing on the first feature.
  • the neural network model 103 shown in Figure 3 i.e., the new Transformer neural network model
  • the neural network model 103 includes a subnetwork 201 and 3 classification networks, and these 3 classification networks are respectively the classification network 202, the classification network 203 and a classification network 204
  • the above-mentioned classification network 202 includes a block merging module 2021, a first normalization layer 2022, a Fourier layer 2023, a second normalization layer 2024 and a multi-layer perceptron 2025, and the above-mentioned block merging module 2021 is used
  • the first normalization layer 2022 is used to normalize the output results of the block merging module 2021
  • the Fourier layer 2023 is used to The output result of the first normalization layer 2022 is processed by inverse Fourier transform, and the second normalization
  • the block merging module 2021 in the above classification network 202 is used to perform merging processing, channel expansion and matrix transformation on the first features output by the sub-network 201
  • the first feature output by the sub-network 201 is a 64 ⁇ 49 ⁇ 128 Feature map
  • block merging module 2021 performs pairwise merging of the image blocks in the first feature to obtain a 28 ⁇ 28 ⁇ 128 image; then expands the 28 ⁇ 28 ⁇ 128 image to 256 channels to obtain 28 ⁇ 28 ⁇ 256
  • the image of 28 ⁇ 28 ⁇ 256 is then matrix transformed to obtain an image of 16 ⁇ 49 ⁇ 256; or, the block merging module 2021 performs 256-channel expansion on the image block in the first feature to obtain an image of 64 ⁇ 49 ⁇ 256 image; then merge the 64 ⁇ 49 ⁇ 256 image in pairs to obtain a 28 ⁇ 28 ⁇ 256 image; then perform matrix transformation on the 28 ⁇ 28 ⁇ 256 image to obtain a 16 ⁇ 49 ⁇ 256 image.
  • the above-mentioned first normalization layer 2022 normalizes the 16 ⁇ 49 ⁇ 256 image output by the block merging module 2021 to obtain a normalized 16 ⁇ 49 ⁇ 256 image;
  • the Fourier layer 2023 normalizes The transformed 16 ⁇ 49 ⁇ 256 image is processed by inverse Fourier transform to obtain a 16 ⁇ 49 ⁇ 256 image after inverse Fourier transform;
  • the second normalization layer 2024 is for the 16 ⁇ 49 ⁇ output of the block merging module 2021 256 images and the stitching results of 16 ⁇ 49 ⁇ 256 images after inverse Fourier transform are normalized to obtain 16 ⁇ 49 ⁇ 256 images after the second normalization;
  • Feature extraction is performed on the transformed 16 ⁇ 49 ⁇ 256 image to obtain a feature map (second feature) of 16 ⁇ 49 ⁇ 256.
  • the above-mentioned classification network 203 includes a block merging module 2031, a first normalization layer 2032, a Fourier layer 2033, a second normalization layer 2034 and a multi-layer perceptron 2035, and the above-mentioned block merging module 2031 is used for input classification network 203
  • the first normalization layer 2032 is used to normalize the output of the block merging module 2031
  • the Fourier layer 2033 is used to normalize the first normalization layer 2032
  • Inverse Fourier transform processing is performed on the output result of the block merging module 2031 and the output result of the Fourier layer 2033
  • the second normalization layer 2034 is used for normalization processing. Perform feature extraction processing on the output result of the second normalization layer.
  • the block merging module 2031 in the above-mentioned classification network 203 is used to perform merging processing, channel expansion and matrix transformation on the second feature output by the classification network 202, for example, the second feature output by the classification network 202 is a 16 ⁇ 49 ⁇ 256 Feature map, the block merging module 2031 performs pairwise merging processing on the image blocks in the second feature to obtain a 14 ⁇ 14 ⁇ 256 image; then expands the 14 ⁇ 14 ⁇ 256 image to 512 channels to obtain 14 ⁇ 14 ⁇ 512 The image of 14 ⁇ 14 ⁇ 512 is then matrix transformed to obtain a 2 ⁇ 49 ⁇ 512 image; or, the block merging module 2031 performs 512-channel expansion on the image block in the second feature to obtain a 16 ⁇ 49 ⁇ 512 image image; then merge the 16 ⁇ 49 ⁇ 512 images in pairs to obtain a 14 ⁇ 14 ⁇ 512 image; then perform matrix transformation on the 14 ⁇ 14 ⁇ 512 image to obtain a 2 ⁇ 49 ⁇ 512 image.
  • the above-mentioned first normalization layer 2032 normalizes the 2 ⁇ 49 ⁇ 512 image output by the block merging module 2031 to obtain a normalized 2 ⁇ 49 ⁇ 512 image; the Fourier layer 2033 normalizes The transformed 2 ⁇ 49 ⁇ 512 image is processed by inverse Fourier transform to obtain a 2 ⁇ 49 ⁇ 512 image after inverse Fourier transform; the second normalization layer 2034 performs the 2 ⁇ 49 ⁇ 512 images and 2 ⁇ 49 ⁇ 512 images after inverse Fourier transform are normalized to obtain the second normalized 2 ⁇ 49 ⁇ 512 images; the multi-layer perceptron 2035 Feature extraction is performed on the transformed 2 ⁇ 49 ⁇ 512 image to obtain a 2 ⁇ 49 ⁇ 512 feature map (the third feature).
  • the above-mentioned classification network 204 includes a block merging module 2041, a first normalization layer 2042, a Fourier layer 2043, a second normalization layer 2044 and a multi-layer perceptron 2045, and the above-mentioned block merging module 2041 is used for input classification network 204
  • the first normalization layer 2042 is used to normalize the output result of the block merging module 2041
  • the Fourier layer 2043 is used to normalize the first normalization layer 2042
  • Inverse Fourier transform processing is performed on the output result of the block merging module 2041 and the output result of the Fourier layer 2043
  • the second normalization layer 2044 is used for normalization processing. Perform feature extraction processing on the output result of the second normalization layer.
  • the block merging module 2041 in the above-mentioned classification network 204 is used to merge and process the third feature output by the classification network 203 and channel expansion, for example, the third feature output by the classification network 203 is a feature map of 2 ⁇ 49 ⁇ 512
  • the block merging module 2041 performs pairwise merging processing on the image blocks in the third feature to obtain a 7 ⁇ 7 ⁇ 512 image; then expands the 7 ⁇ 7 ⁇ 512 image by 1024 channels to obtain a 7 ⁇ 7 ⁇ 1024 image; Then perform matrix transformation on the 7 ⁇ 7 ⁇ 1024 image to obtain a 1 ⁇ 49 ⁇ 1024 image; or, the block merging module 2041 performs 1024 channel expansion on the image block in the third feature to obtain a 2 ⁇ 49 ⁇ 1024 image; then The 2 ⁇ 49 ⁇ 1024 images are combined in pairs to obtain a 7 ⁇ 7 ⁇ 1024 image; then the 7 ⁇ 7 ⁇ 1024 image is transformed into a matrix to obtain a 1 ⁇ 49 ⁇ 1024 image.
  • the above-mentioned first normalization layer 2042 normalizes the 1 ⁇ 49 ⁇ 1024 image output by the block merging module 2041 to obtain a normalized 1 ⁇ 49 ⁇ 1024 image;
  • the Fourier layer 2043 normalizes The transformed 1 ⁇ 49 ⁇ 1024 image is subjected to inverse Fourier transform processing to obtain a 1 ⁇ 49 ⁇ 1024 image after inverse Fourier transform;
  • the second normalization layer 2044 outputs the 1 ⁇ 49 ⁇
  • the splicing result of the 1024 image and the 1 ⁇ 49 ⁇ 1024 image after the inverse Fourier transform is normalized to obtain the 1 ⁇ 49 ⁇ 1024 image after the second normalization;
  • the multilayer perceptron 2045 performs the second normalization Feature extraction is performed on the transformed 1 ⁇ 49 ⁇ 1024 image to obtain a 1 ⁇ 49 ⁇ 1024 feature map (the fourth feature).
  • the application replaces the complex self-attention layer in the classification network 202, the classification network 203 and the classification network 204 with the Fourier layer, which not only reduces the complexity of the classification network 202, the classification network 203 and the classification network 204, It also saves computing time and memory usage.
  • At least one output result of the classification network is processed through a Softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category.
  • the Softmax function processes the result of at least one classification network output in the neural network model 103 to obtain at least one probability value.
  • the Softmax function outputs the fourth feature map (i.e.
  • the fourth feature map 1 ⁇ 49 ⁇ 1024 feature map
  • the maximum probability of the picture category corresponding to the map is determined as the picture category corresponding to the fourth feature map (ie, the probability of at least one image category). For example, there are 3 different picture categories, and these 3 picture categories are animals, people and landscapes respectively. If the target picture is a natural scenery, then the target picture passes through the fourth feature map output by the neural network model 103, and then passes through The Softmax function will output the probabilities of the fourth feature map corresponding to the above three image categories.
  • the fourth feature map has an animal probability of 20%
  • the fourth feature map has a human probability of 40%
  • the fourth feature The probability that the map is landscape is 90%
  • the fourth feature map corresponds to the probability of animals, people and landscapes in order from high to low (that is, the probability of landscape > the probability of people > the probability of animals)
  • the fourth feature map determines the fourth feature map as the landscape class (that is, the image classification result of the target image in Figure 3).
  • the neural network model 103 is trained on the first 50 categories of the ImageNet-1K data set, wherein the size of the training set is 64817 pictures, the size of the verification set is 2500 pictures, and there are 50 categories in total.
  • the network prediction result i.e. the output result of the neural network model 103
  • the Softmax function is input to the Softmax function and converted into the output result of the neural network model 103 corresponding to different category probability values, the formula is:
  • p( xi ) is the real probability distribution
  • q( xi ) is the predicted probability distribution.
  • Table 1 shows the image classification results when the neural network model 103 and the Swin-B method perform image classification on the top 50 categories of the ImageNet-1K data set, where Methods is the name of the method; ImageSize is the size of the input image during training; Param indicates the number of training parameters; Throughput (image/s) is throughput (that is, the ability to process pictures per second); ImageNet indicates the accuracy of image classification on the ImageNet dataset, that is, the picture category with the highest probability and the actual picture The accuracy rate of the category matching; FLOPs represents the computing power required by the network model, and FLOPs (floating-point operations) floating-point operations are used to measure the complexity of the network model; Step represents the time required for each training step when the batch size is 64 Time; Epoch indicates the time required for each training cycle when the batch size is 64.
  • the required time is obtained by testing on 4 GPUs (GeForce GTX1080). Compared with the Swin-B method, the parameters of the neural network model 103 of the present application are reduced by 32%, the required floating-point operations are reduced by 33.7%, the time is reduced by 32.6%, and the accuracy reaches 90%.
  • this application uses the complex self-attention layer in the existing Transformer neural network model to use Fourier Instead of inverse transformation, a new Transformer neural network model is formed; the new Transformer neural network model only needs to perform one inverse Fourier transform on the feature map in the preset format without performing multiple convolutions to extract the features of the target image and Determine the image classification result of the target image; compared with the existing Transformer neural network model, the new Transformer neural network model reduces the calculation time and memory usage in the process of feature extraction and image classification of the target image.
  • this application also replaces the complex self-attention layer in the above-mentioned classification network with a Fourier layer to form a new classification network; the new classification network performs image classification on the target image; compared with the existing classification network, the new classification network Computational time and memory usage are both reduced when performing image classification on target images.
  • FIG. 4 shows a schematic structural diagram of an image classification device provided by the present application.
  • the dotted line in Figure 4 indicates that the unit or the module is optional.
  • the device 400 may be used to implement the methods described in the foregoing method embodiments.
  • the apparatus 400 may be a terminal device or a server or a chip.
  • the device 400 includes one or more processors 401, and the one or more processors 401 can support the device 400 to implement the method in the method embodiment corresponding to FIG. 2 .
  • the processor 401 may be a general purpose processor or a special purpose processor.
  • the processor 401 may be a central processing unit (central processing unit, CPU).
  • the CPU can be used to control the device 400, execute software programs, and process data of the software programs.
  • the device 400 may further include a communication unit 405, configured to implement signal input (reception) and output (transmission).
  • the apparatus 400 may be a chip, and the communication unit 405 may be an input and/or output circuit of the chip, or the communication unit 405 may be a communication interface of the chip, and the chip may serve as a component of a terminal device.
  • the apparatus 400 may be a terminal device, and the communication unit 405 may be a transceiver of the terminal device, or the communication unit 405 may be a transceiver circuit of the terminal device.
  • the device 400 may include one or more memories 402, on which there is a program 404, which can be run by the processor 401 to generate instructions 403, so that the processor 401 executes the methods described in the above method embodiments according to the instructions 403.
  • data (such as the ID of the chip to be tested) may also be stored in the memory 402 .
  • the processor 401 may also read data stored in the memory 402 , the data may be stored in the same storage address as the program 404 , and the data may also be stored in a different storage address from the program 404 .
  • the processor 401 and the memory 402 may be set separately, or may be integrated together, for example, integrated on a system-on-chip (system on chip, SOC) of a terminal device.
  • SOC system on chip
  • the steps in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor 401 .
  • the processor 401 may be a CPU, a digital signal processor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices or discrete hardware components.
  • the present application also provides a computer program product, which implements the method described in any method embodiment in the present application when the computer program product is executed by the processor 401 .
  • the computer program product can be stored in the memory 402 , such as a program 404 , and the program 404 is finally converted into an executable target file that can be executed by the processor 401 through processes such as preprocessing, compiling, assembling and linking.
  • the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in the present application is implemented.
  • the computer program may be a high-level language program or an executable object program.
  • the computer readable storage medium is, for example, the memory 402 .
  • the memory 402 may be a volatile memory or a nonvolatile memory, or, the memory 402 may include both a volatile memory and a nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM static random access memory
  • dynamic RAM dynamic random access memory
  • synchronous dynamic random access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory direct rambus RAM, DRRAM
  • the disclosed systems, devices and methods may be implemented in other ways. For example, some features of the method embodiments described above may be omitted, or not implemented.
  • the device embodiments described above are only illustrative, and the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system.
  • the coupling between the various units or the coupling between the various components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

An image classification method, comprising: acquiring a target image (S201); pre-processing the target image to generate a feature map of a preset format (S202); performing an inverse Fourier transform on the feature map of the preset format to generate a first inverse transform result (S203); splicing the feature map of the preset format and the first inverse transform result to generate a first splicing result (S204); performing feature extraction on the first splicing result to generate a first feature (S205); and determining an image classification result according to the first feature (S206). The problems of long calculation time and large memory occupation of an existing Transformer neural network model can be solved.

Description

一种图像分类的方法和装置A method and device for image classification 技术领域technical field
本申请涉及图像处理领域,尤其涉及一种图像分类的方法和装置。The present application relates to the field of image processing, and in particular to an image classification method and device.
背景技术Background technique
Transformer是一种基于自我注意机制的深度神经网络,不仅应用于自然语言处理领域而且还应用于图像处理领域,比如,将二维图像数据转化成一维序列以及对二维图像进行多尺度特征提取,但是Transformer神经网络模型极其复杂,从而导致该神经网络模型的内存占用较大、训练时间较长。Transformer is a deep neural network based on a self-attention mechanism, which is not only used in the field of natural language processing but also in the field of image processing, such as converting two-dimensional image data into one-dimensional sequences and extracting multi-scale features from two-dimensional images. However, the Transformer neural network model is extremely complex, resulting in a large memory footprint and a long training time for the neural network model.
因此,如何减少现有Transformer神经网络模型的计算时间和内存占用是当前急需解决的问题。Therefore, how to reduce the calculation time and memory usage of the existing Transformer neural network model is an urgent problem to be solved.
技术问题technical problem
本申请实施例的目的之一在于:提供一种图像分类的方法和装置,旨在解决现有Transformer神经网络模型的计算时间长和内存占用较大的问题。One of the purposes of the embodiments of the present application is to provide a method and device for image classification, aiming at solving the problems of long calculation time and large memory usage of the existing Transformer neural network model.
技术解决方案technical solution
本申请实施例采用的技术方案是:The technical scheme that the embodiment of the present application adopts is:
第一方面,提供了一种图像分类的方法,包括:获取目标图像;对所述目标图像进行预处理,生成预设格式的特征图;对所述预设格式的特征图进行傅里叶逆变换,生成第一逆变换结果;对所述预设格式的特征图和所述第一逆变换结果进行拼接处理,生成第一拼接结果;对所述第一拼接结果进行特征提取,生成第一特征;根据所述第一特征确定图像分类结果。In the first aspect, an image classification method is provided, including: acquiring a target image; performing preprocessing on the target image to generate a feature map in a preset format; performing Fourier inversion on the feature map in a preset format Transform to generate a first inverse transform result; splicing the feature map in the preset format and the first inverse transform result to generate a first splicing result; performing feature extraction on the first splicing result to generate a first Features; determining an image classification result according to the first features.
上述方法可以由电子设备上的芯片执行。相比现有Transformer神经网络模型中使用复杂的自注意力层对预设格式的特征图进行多次卷积处理,本申请将现有Transformer神经网络模型中复杂的自注意力层使用傅里叶逆变换代替,形成新Transformer神经网络模型;该新Transformer神经网络模型只需对预设格式的特征图进行一次傅里叶逆变换而不需要进行多次卷积处理就可以提取目标图像的特征并确定目标图像的图像分类结果;相比现有Transformer神经网络模型,该新Transformer神经网络模型在对目标图像进行特征提取以及图像分类的过程中计算时间和内存占用均有所减少。The above method can be executed by a chip on an electronic device. Compared with the complex self-attention layer used in the existing Transformer neural network model to perform multiple convolutions on the feature map of the preset format, this application uses the complex self-attention layer in the existing Transformer neural network model to use Fourier Instead of inverse transformation, a new Transformer neural network model is formed; the new Transformer neural network model only needs to perform one inverse Fourier transform on the feature map in the preset format without performing multiple convolutions to extract the features of the target image and Determine the image classification result of the target image; compared with the existing Transformer neural network model, the new Transformer neural network model reduces the calculation time and memory usage in the process of feature extraction and image classification of the target image.
可选地,所述根据所述第一特征确定图像分类结果,包括:对所述第一特征和所述第一逆变换结果进行拼接处理,生成第二拼接结果;通过至少一个分类网络对所述第二拼接 结果进行分类处理,其中,所述至少一个分类网络中任意一个分类网络包括块合并模块、第一归一化层、傅里叶层、第二归一化层和多层感知机,所述块合并模块用于对输入所述分类网络的数据进行合并处理,所述第一归一化层用于对所述块合并模块的输出结果进行归一化处理,所述傅里叶层用于对所述第一归一化层的输出结果进行傅里叶逆变换处理,所述第二归一化层用于对所述块合并模块和所述傅里叶层的输出结果的拼接结果进行归一化处理,所述多层感知机用于对所述第二归一化层的输出结果进行特征提取处理。Optionally, the determining the image classification result according to the first feature includes: performing splicing processing on the first feature and the first inverse transformation result to generate a second splicing result; The second splicing result is classified, wherein any classification network in the at least one classification network includes a block merging module, a first normalization layer, a Fourier layer, a second normalization layer and a multi-layer perceptron , the block merging module is used for merging the data input to the classification network, the first normalization layer is used for normalizing the output result of the block merging module, and the Fourier The layer is used to perform inverse Fourier transform processing on the output result of the first normalization layer, and the second normalization layer is used to perform the inverse Fourier transform on the output result of the block merging module and the Fourier layer The splicing result is subjected to normalization processing, and the multi-layer perceptron is used to perform feature extraction processing on the output result of the second normalization layer.
将上述分类网络中复杂的自注意力层使用傅里叶层代替,形成新分类网络;该新分类网络对目标图像进行图像分类;相比现有分类网络,该新分类网络在对目标图像进行图像分类时计算时间和内存占用均有所减少。The complex self-attention layer in the above classification network is replaced by the Fourier layer to form a new classification network; the new classification network performs image classification on the target image; compared with the existing classification network, the new classification network performs image classification on the target image Computational time and memory usage have been reduced for image classification.
可选地,通过softmax函数处理所述至少一个分类网络的输出结果,生成至少一个概率值,所述至少一个概率值用于指示所述目标图像属于至少一个图像类别的概率。Optionally, the output result of the at least one classification network is processed through a softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category.
可选地,所述傅里叶逆变换为快速傅里叶逆变换。将复杂的自注意力层使用快速傅里叶逆变换代替,能够加速新Transformer神经网络模型以及分类网络的运算速度。Optionally, the inverse Fourier transform is an inverse fast Fourier transform. Replacing the complex self-attention layer with inverse fast Fourier transform can speed up the operation speed of the new Transformer neural network model and classification network.
可选地,所述傅里叶逆变换的公式为:Optionally, the formula of the inverse Fourier transform is:
Figure PCTCN2021119682-appb-000001
Figure PCTCN2021119682-appb-000001
其中,x n为时域离散信号,x k为频域离散信号,N为时域采样点数,k为当前采样点。 Among them, x n is the discrete signal in the time domain, x k is the discrete signal in the frequency domain, N is the number of sampling points in the time domain, and k is the current sampling point.
可选地,所述对所述目标图像进行预处理,包括:对所述目标图像进行块分割处理,生成块分割结果;通过线性嵌入层对所述块分割结果进行处理,生成所述预设格式的特征图。Optionally, the preprocessing of the target image includes: performing block segmentation processing on the target image to generate a block segmentation result; processing the block segmentation result through a linear embedding layer to generate the preset Format feature maps.
第二方面,提供了一种图像分类的装置,包括用于执行第一方面中任一种方法的模块。In a second aspect, an image classification device is provided, including a module for performing any one of the methods in the first aspect.
第三方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储了计算机程序,当所述计算机程序被处理器执行时,使得处理器执行第一方面中任一项所述的方法。In a third aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of the first aspect. Methods.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments or exemplary technical descriptions. Obviously, the accompanying drawings in the following descriptions are only for this application. For some embodiments, those skilled in the art can also obtain other drawings based on these drawings without creative efforts.
图1为本申请实施例中图像分类系统的结构示意图;FIG. 1 is a schematic structural diagram of an image classification system in an embodiment of the present application;
图2为本申请实施例中图像分类的方法流程示意图;FIG. 2 is a schematic flow diagram of a method for classifying images in an embodiment of the present application;
图3为本申请实施例中新Transformer神经网络模型的结构示意图;Fig. 3 is the structural representation of new Transformer neural network model in the embodiment of the present application;
图4为本申请实施例中图像分类的装置示意图。FIG. 4 is a schematic diagram of an image classification device in an embodiment of the present application.
本申请的实施方式Embodiment of this application
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。因此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Accordingly, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc., in various places in this specification are not necessarily all References to the same embodiment mean "one or more but not all" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.
随着图像处理技术的快速发展,Transformer神经网络模型被用于图像处理领域。但是,现有Transformer神经网络模型中复杂的注意力层,导致该模型不仅计算时间长而且内存占用较大。为了解决现有Transformer神经网络模型的计算时间时间长且内存占用较大的问题,本申请将现有Transformer神经网络模型中复杂的自注意力层用傅里叶逆变换层代替,形成新Transformer神经网络模型。该新Transformer神经网络模型解决了现有Transformer神经网络模型的计算时间长和内存占用较大的问题。With the rapid development of image processing technology, the Transformer neural network model is used in the field of image processing. However, the complex attention layer in the existing Transformer neural network model leads to long calculation time and large memory usage of the model. In order to solve the problem of long calculation time and large memory usage of the existing Transformer neural network model, this application replaces the complex self-attention layer in the existing Transformer neural network model with an inverse Fourier transform layer to form a new Transformer neural network model. network model. The new Transformer neural network model solves the problems of long calculation time and large memory occupation of the existing Transformer neural network model.
下面结合附图和具体实施例对本申请做进一步详细说明。The present application will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.
图1示出了本申请提供的一种图像分类系统,该系统包括输入单元101、预处理单元102、神经网络模型103和输出单元104,其中,输入单元101用于接收用户输入的目标图像,并将该目标图像发送给预处理单元102;预处理单元102对该目标图像进行图像分割和矩阵变换的预处理,并将预设格式的特征图发送给神经网络模型103;神经网络模型103 对预设格式的特征图进行特征提取和图像分类处理,并将图像分类结果发送给输出单元104;输出单元104用于接收神经网络模型103发送的图像分类结果,并将图像分类结果输出到客户端,以便于用户查看分类结果。Fig. 1 shows a kind of image classification system provided by the present application, this system comprises input unit 101, preprocessing unit 102, neural network model 103 and output unit 104, wherein, input unit 101 is used for receiving the target image of user input, And the target image is sent to the preprocessing unit 102; the preprocessing unit 102 performs image segmentation and matrix transformation preprocessing on the target image, and sends the feature map of the preset format to the neural network model 103; the neural network model 103 The feature map in the preset format is subjected to feature extraction and image classification processing, and the image classification result is sent to the output unit 104; the output unit 104 is used to receive the image classification result sent by the neural network model 103, and output the image classification result to the client , so that users can view the classification results.
为了能够减少现有Transformer神经网络模型的计算时间和内存占用,本申请提出了一种图像分类的方法,如图2所示,该方法可以由电子设备中的芯片执行,该方法包括:In order to reduce the calculation time and memory usage of the existing Transformer neural network model, the present application proposes a method for image classification, as shown in Figure 2, the method can be executed by a chip in an electronic device, and the method includes:
S201,获取目标图像。S201. Acquire a target image.
示例性地,电子设备可以通过图1所示的输入单元101获取用户输入的待分类的目标图像。该目标图像可以是任意大小尺寸的图像,比如,尺寸大小为224×224的目标图像,本申请对目标图像的尺寸大小不做任何限定。Exemplarily, the electronic device may acquire the target image to be classified input by the user through the input unit 101 shown in FIG. 1 . The target image may be an image of any size, for example, a target image with a size of 224×224, and this application does not impose any limitation on the size of the target image.
S202,对目标图像进行预处理,生成预设格式的特征图。S202. Preprocessing the target image to generate a feature map in a preset format.
示例性地,电子设备通过预处理单元102对目标图像进行预处理,生成预设格式的特征图。该预设格式的特征图为预处理单元102的输出结果,用于进行傅里叶逆变换。上述预处理单元102包括块分割模块和线性嵌入层,其中,块分割模块用于对目标图像进行块分割处理;线性嵌入层用于对块分割结果进行矩阵变换。例如,用户通过输入单元101输入尺寸大小为224×224的3通道(即RGB三通道)目标图像,块分割模块对该224×224的目标图像进行块分割处理,并将该目标图像分割为56×56个块,每块大小为4×4的3通道图像,即块分割模块输出的结果(即块分割结果)为56×56×3的图像;将56×56×3的图像输入到线性嵌入层进行通道扩展和矩阵变换,得到64×49×128的图像(即预设格式的特征图),其中,128表示通道数,即线性嵌入层对56×56×3的图像进行128通道扩展,得到56×56×128的图像;再对56×56×128的图像进行矩阵变化得到64×49×128的图像;或者,线性嵌入层对56×56×3的图像进行矩阵变化,得到64×49×3的图像;再对64×49×3的图像进行128通道扩展得到64×49×128的图像。Exemplarily, the electronic device preprocesses the target image through the preprocessing unit 102 to generate a feature map in a preset format. The feature map in the preset format is the output result of the preprocessing unit 102, and is used for inverse Fourier transform. The above-mentioned preprocessing unit 102 includes a block segmentation module and a linear embedding layer, wherein the block segmentation module is used to perform block segmentation processing on the target image; the linear embedding layer is used to perform matrix transformation on the block segmentation result. For example, the user inputs a 3-channel (i.e., RGB three-channel) target image with a size of 224×224 through the input unit 101, and the block segmentation module performs block segmentation processing on the 224×224 target image, and divides the target image into 56 ×56 blocks, each block is a 3-channel image with a size of 4×4, that is, the output result of the block segmentation module (ie, the block segmentation result) is an image of 56 × 56 × 3; the image of 56 × 56 × 3 is input to the linear The embedding layer performs channel expansion and matrix transformation to obtain a 64×49×128 image (that is, a feature map in a preset format), where 128 represents the number of channels, that is, the linear embedding layer performs 128 channel expansion on a 56×56×3 image , to get a 56×56×128 image; then perform matrix transformation on the 56×56×128 image to obtain a 64×49×128 image; or, perform a matrix transformation on the 56×56×3 image in the linear embedding layer to obtain 64 ×49×3 image; then perform 128-channel expansion on the 64×49×3 image to obtain a 64×49×128 image.
示例性地,电子设备将预处理单元102输出的预设格式的特征图发送给神经网络模型103,该神经网络模型103为新Transformer神经网络模型,包括:子网络201和至少一个分类网络,比如,图3所示新Transformer神经网络模型中包括一个子网络201和3个分类网络,这3个分类网络分别是分类网络202、分类网络203和分类网络204;上述子网络201包括归一化层2011、傅里叶层2012、归一化层2013和多层感知机2014,归一化层2011用于对预处理单元102发送的预设格式的特征图进行归一化操作;傅里叶层2012用于对归一化层2011的输出结果进行傅里叶逆变换;归一化层2013用于对傅里叶层2012输出的结果进行归一化操作;多层感知机2014用于对归一化层2013输出的结果进行特征提取。例如,电子设备将预处理单元102输出的预设格式的特征图发送给子网络201的归一化层2021 进行归一化处理,生成第一归一化结果。Exemplarily, the electronic device sends the feature map of the preset format output by the preprocessing unit 102 to the neural network model 103, and the neural network model 103 is a new Transformer neural network model, including: a sub-network 201 and at least one classification network, such as , the new Transformer neural network model shown in Figure 3 includes a sub-network 201 and 3 classification networks, and these 3 classification networks are respectively a classification network 202, a classification network 203 and a classification network 204; the above-mentioned sub-network 201 includes a normalization layer 2011, Fourier layer 2012, normalization layer 2013 and multi-layer perceptron 2014, the normalization layer 2011 is used to normalize the feature map of the preset format sent by the preprocessing unit 102; Fourier layer 2012 is used to perform Fourier inverse transform on the output result of the normalization layer 2011; the normalization layer 2013 is used to perform normalization operation on the output result of the Fourier layer 2012; The result output by the normalization layer 2013 is subjected to feature extraction. For example, the electronic device sends the feature map in a preset format output by the pre-processing unit 102 to the normalization layer 2021 of the sub-network 201 for normalization processing to generate a first normalization result.
S203,对预设格式的特征图进行傅里叶逆变换,生成第一逆变换结果。S203. Perform an inverse Fourier transform on the feature map in a preset format to generate a first inverse transform result.
示例性地,子网络201中的归一化层2011先对预设格式的特征图进行归一化操作,之后,将归一化层2011输出的第一归一化结果发送给傅里叶层2012;傅里叶层2012接收归一化层2011的第一归一化结果,并对该第一归一化结果进行傅里叶逆变换,得到第一逆变换结果,上述傅里叶逆变换的公式为:Exemplarily, the normalization layer 2011 in the subnetwork 201 first performs a normalization operation on the feature map in the preset format, and then sends the first normalization result output by the normalization layer 2011 to the Fourier layer 2012; Fourier layer 2012 receives the first normalization result of the normalization layer 2011, and performs Fourier inverse transform on the first normalization result to obtain the first inverse transform result, the above-mentioned Fourier inverse transform The formula for is:
Figure PCTCN2021119682-appb-000002
Figure PCTCN2021119682-appb-000002
其中,x n为时域离散信号,x k为频域离散信号,N为时域采样点数,k为当前采样点。新Transformer神经网络模型中的傅里叶层2012代替了现有Transformer神经网络模型中复杂的自注意力层,相比现有Transformer神经网络模型,该新Transformer神经网络模型在对目标图像进行一次傅里叶逆变换就可以提取图像特征而不需要进行多次复杂的卷积运算来提取图像特征并对目标图像进行分类,因此,新Transformer神经网络模型在进行目标图像特征提取的过程中计算时间和内存占用均有所减少。 Among them, x n is the discrete signal in the time domain, x k is the discrete signal in the frequency domain, N is the number of sampling points in the time domain, and k is the current sampling point. The Fourier layer 2012 in the new Transformer neural network model replaces the complex self-attention layer in the existing Transformer neural network model. Compared with the existing Transformer neural network model, the new Transformer neural network model performs a Fu Liye inverse transform can extract image features without multiple complex convolution operations to extract image features and classify target images. Therefore, the new Transformer neural network model calculates time and Memory usage has been reduced.
可选地,傅里叶层2012还可以对第一归一化结果进行快速傅里叶逆变换,得到第一逆变换结果。对第一归一化结果进行快速傅里叶逆变换可以加速新Transformer神经网络模型的运算速度。Optionally, the Fourier layer 2012 may also perform inverse fast Fourier transform on the first normalization result to obtain the first inverse transform result. Performing inverse fast Fourier transform on the result of the first normalization can accelerate the operation speed of the new Transformer neural network model.
S204,对预设格式的特征图和第一逆变换结果进行拼接处理,生成第一拼接结果。S204. Concatenate the feature map in the preset format and the first inverse transformation result to generate a first concatenation result.
示例性地,子网络201将预设格式的特征图和傅里叶层2012输出的第一逆变换结果进行拼接处理,得到第一拼接结果(即拼接图像),上述拼接处理是指将预设格式的特征图和第一逆变换结果对应像素进行求和。比如,预处理单元102输出64×49×128的图像中A位置像素点值为X1,第一逆变换结果为64×49×128的图像中A位置像素点值为X2,将预设格式的特征图和第一逆变换结果进行拼接,即将预处理单元102输出64×49×128的图像中A位置像素点值与第一逆变换结果为64×49×128的图像中A位置像素点值相加(即X1+X2)。预设格式的特征图和第一逆变换结果的拼接结果为第一拼接结果(即X1+X2)。对于预设格式的特征图和第一逆变换结果中其他位置像素值的拼接处理同A位置像素点值得拼接处理,在此不再赘述。Exemplarily, the sub-network 201 splices the feature map in the preset format and the first inverse transform result output by the Fourier layer 2012 to obtain the first splicing result (that is, the spliced image). The above splicing process refers to the preset The feature map of the format and the corresponding pixels of the first inverse transformation result are summed. For example, the preprocessing unit 102 outputs the value of the pixel at position A in the 64×49×128 image as X1, the first inverse transformation result is the value of the pixel at position A in the 64×49×128 image as X2, and the preset format The feature map and the first inverse transformation result are spliced, that is, the preprocessing unit 102 outputs the value of the pixel at position A in the image of 64×49×128 and the value of the pixel at position A in the image whose first inverse transformation result is 64×49×128 Add (ie X1+X2). The concatenated result of the feature map in the preset format and the first inverse transformation result is the first concatenated result (ie X1+X2). The splicing process of the feature map in the preset format and the pixel values at other positions in the first inverse transformation result is the same as the splicing process of the pixel value at position A, and will not be repeated here.
S205,对第一拼接结果进行特征提取,生成第一特征。S205, performing feature extraction on the first splicing result to generate a first feature.
示例性地,子网络201的多层感知机2014(又称多层感知器),用于对第一拼接结果 进行特征提取,得到第一特征。上述多层感知机2014是含有至少一个隐藏层的由全连接层组成的神经网络,且每个隐藏层的输出通过激活函数进行变换。多层感知机2014的层数和各隐藏层中隐藏单元个数都是超参数。例如,第一拼接结果为64×49×128的拼接图像,多层感知机2014对该64×49×128的拼接图像进行特征提取,得到64×49×128的特征图(第一特征)。Exemplarily, the multi-layer perceptron 2014 (also known as multi-layer perceptron) of the sub-network 201 is used to perform feature extraction on the first splicing result to obtain the first feature. The above-mentioned multi-layer perceptron 2014 is a neural network composed of fully connected layers containing at least one hidden layer, and the output of each hidden layer is transformed by an activation function. The number of layers of the multilayer perceptron 2014 and the number of hidden units in each hidden layer are hyperparameters. For example, the first stitching result is a stitched image of 64×49×128, and the multilayer perceptron 2014 performs feature extraction on the stitched image of 64×49×128 to obtain a feature map (first feature) of 64×49×128.
S206,根据第一特征确定图像分类结果。S206. Determine an image classification result according to the first feature.
示例性地,神经网络模型103中至少一个分类网络对该第一特征进行图像分类处理。以图3所示的神经网络模型103(即新Transformer神经网络模型)为例,该神经网络模型103包括一个子网络201和3个分类网络,这3个分类网络分别是分类网络202、分类网络203和分类网络204,上述分类网络202包括块合并模块2021、第一归一化层2022、傅里叶层2023、第二归一化层2024和多层感知机2025,上述块合并模块2021用于对输入分类网络202的数据进行合并处理、通道扩展以及矩阵变换,第一归一化层2022用于对块合并模块2021的输出结果进行归一化处理,傅里叶层2023用于对第一归一化层2022的输出结果进行傅里叶逆变换处理,第二归一化层2024用于对块合并模块2021和傅里叶层2023的输出结果的拼接结果进行归一化处理,多层感知机2025用于对第二归一化层的输出结果进行特征提取处理。Exemplarily, at least one classification network in the neural network model 103 performs image classification processing on the first feature. Taking the neural network model 103 shown in Figure 3 (i.e., the new Transformer neural network model) as an example, the neural network model 103 includes a subnetwork 201 and 3 classification networks, and these 3 classification networks are respectively the classification network 202, the classification network 203 and a classification network 204, the above-mentioned classification network 202 includes a block merging module 2021, a first normalization layer 2022, a Fourier layer 2023, a second normalization layer 2024 and a multi-layer perceptron 2025, and the above-mentioned block merging module 2021 is used To perform merging processing, channel expansion and matrix transformation on the data input to the classification network 202, the first normalization layer 2022 is used to normalize the output results of the block merging module 2021, and the Fourier layer 2023 is used to The output result of the first normalization layer 2022 is processed by inverse Fourier transform, and the second normalization layer 2024 is used to perform normalization processing on the splicing results of the output results of the block merging module 2021 and the Fourier layer 2023. The layer perceptron 2025 is used to perform feature extraction processing on the output result of the second normalization layer.
例如,上述分类网络202中的块合并模块2021用于对子网络201输出的第一特征进行合并处理、通道扩展和矩阵变换,比如,子网络201输出的第一特征为64×49×128的特征图,块合并模块2021对第一特征中的图像块进行两两合并处理,得到28×28×128的图像;再对28×28×128的图像进行256通道的扩展得到28×28×256的图像;再对28×28×256的图像进行矩阵变换得到16×49×256的图像;或者,块合并模块2021对第一特征中的图像块进行256通道扩展,得到64×49×256的图像;再对64×49×256的图像进行两两合并处理,得到28×28×256的图像;再对28×28×256的图像进行矩阵变换得到16×49×256的图像。上述第一归一化层2022对块合并模块2021的输出的16×49×256的图像进行归一化处理,得到归一化后的16×49×256图像;傅里叶层2023对归一化后的16×49×256图像进行傅里叶逆变换处理,得到傅里叶逆变换后的16×49×256图像;第二归一化层2024对块合并模块2021输出的16×49×256图像和傅里叶逆变换后的16×49×256图像的拼接结果进行归一化处理,得到第二归一化后的16×49×256图像;多层感知机2025对第二归一化后的16×49×256图像进行特征提取,得到16×49×256的特征图(第二特征)。For example, the block merging module 2021 in the above classification network 202 is used to perform merging processing, channel expansion and matrix transformation on the first features output by the sub-network 201, for example, the first feature output by the sub-network 201 is a 64×49×128 Feature map, block merging module 2021 performs pairwise merging of the image blocks in the first feature to obtain a 28×28×128 image; then expands the 28×28×128 image to 256 channels to obtain 28×28×256 The image of 28 × 28 × 256 is then matrix transformed to obtain an image of 16 × 49 × 256; or, the block merging module 2021 performs 256-channel expansion on the image block in the first feature to obtain an image of 64 × 49 × 256 image; then merge the 64×49×256 image in pairs to obtain a 28×28×256 image; then perform matrix transformation on the 28×28×256 image to obtain a 16×49×256 image. The above-mentioned first normalization layer 2022 normalizes the 16×49×256 image output by the block merging module 2021 to obtain a normalized 16×49×256 image; the Fourier layer 2023 normalizes The transformed 16×49×256 image is processed by inverse Fourier transform to obtain a 16×49×256 image after inverse Fourier transform; the second normalization layer 2024 is for the 16×49× output of the block merging module 2021 256 images and the stitching results of 16×49×256 images after inverse Fourier transform are normalized to obtain 16×49×256 images after the second normalization; Feature extraction is performed on the transformed 16×49×256 image to obtain a feature map (second feature) of 16×49×256.
上述分类网络203包括块合并模块2031、第一归一化层2032、傅里叶层2033、第二归一化层2034和多层感知机2035,上述块合并模块2031用于对输入分类网络203的数据进行合并处理、通道扩展以及矩阵变换,第一归一化层2032用于对块合并模块2031的输出结果进行归一化处理,傅里叶层2033用于对第一归一化层2032的输出结果进行傅里叶逆变换处理,第二归一化层2034用于对块合并模块2031和傅里叶层2033的输出结果的拼接结果进行归一化处理,多层感知机2035用于对第二归一化层的输出结果进行特征提取处理。The above-mentioned classification network 203 includes a block merging module 2031, a first normalization layer 2032, a Fourier layer 2033, a second normalization layer 2034 and a multi-layer perceptron 2035, and the above-mentioned block merging module 2031 is used for input classification network 203 Combine processing, channel expansion and matrix transformation of the data, the first normalization layer 2032 is used to normalize the output of the block merging module 2031, and the Fourier layer 2033 is used to normalize the first normalization layer 2032 Inverse Fourier transform processing is performed on the output result of the block merging module 2031 and the output result of the Fourier layer 2033, and the second normalization layer 2034 is used for normalization processing. Perform feature extraction processing on the output result of the second normalization layer.
例如,上述分类网络203中的块合并模块2031用于对分类网络202输出的第二特征进行合并处理、通道扩展和矩阵变换,比如,分类网络202输出的第二特征为16×49×256的特征图,块合并模块2031对第二特征中的图像块进行两两合并处理,得到14×14×256的图像;再对14×14×256的图像进行512通道的扩展得到14×14×512的图像;再对14×14×512的图像进行矩阵变换得到2×49×512的图像;或者,块合并模块2031对第二特征中的图像块进行512通道扩展,得到16×49×512的图像;再对16×49×512的图像进行两两合并处理,得到14×14×512的图像;再对14×14×512的图像进行矩阵变换得到2×49×512的图像。上述第一归一化层2032对块合并模块2031的输出的2×49×512的图像进行归一化处理,得到归一化后的2×49×512图像;傅里叶层2033对归一化后的2×49×512图像进行傅里叶逆变换处理,得到傅里叶逆变换后的2×49×512图像;第二归一化层2034对块合并模块2031输出的2×49×512图像和傅里叶逆变换后的2×49×512图像的拼接结果进行归一化处理,得到第二归一化后的2×49×512图像;多层感知机2035对第二归一化后的2×49×512图像进行特征提取,得到2×49×512的特征图(第三特征)。For example, the block merging module 2031 in the above-mentioned classification network 203 is used to perform merging processing, channel expansion and matrix transformation on the second feature output by the classification network 202, for example, the second feature output by the classification network 202 is a 16×49×256 Feature map, the block merging module 2031 performs pairwise merging processing on the image blocks in the second feature to obtain a 14×14×256 image; then expands the 14×14×256 image to 512 channels to obtain 14×14×512 The image of 14 × 14 × 512 is then matrix transformed to obtain a 2 × 49 × 512 image; or, the block merging module 2031 performs 512-channel expansion on the image block in the second feature to obtain a 16 × 49 × 512 image image; then merge the 16×49×512 images in pairs to obtain a 14×14×512 image; then perform matrix transformation on the 14×14×512 image to obtain a 2×49×512 image. The above-mentioned first normalization layer 2032 normalizes the 2×49×512 image output by the block merging module 2031 to obtain a normalized 2×49×512 image; the Fourier layer 2033 normalizes The transformed 2×49×512 image is processed by inverse Fourier transform to obtain a 2×49×512 image after inverse Fourier transform; the second normalization layer 2034 performs the 2×49× 512 images and 2×49×512 images after inverse Fourier transform are normalized to obtain the second normalized 2×49×512 images; the multi-layer perceptron 2035 Feature extraction is performed on the transformed 2×49×512 image to obtain a 2×49×512 feature map (the third feature).
上述分类网络204包括块合并模块2041、第一归一化层2042、傅里叶层2043、第二归一化层2044和多层感知机2045,上述块合并模块2041用于对输入分类网络204的数据进行合并处理、通道扩展以及矩阵变换,第一归一化层2042用于对块合并模块2041的输出结果进行归一化处理,傅里叶层2043用于对第一归一化层2042的输出结果进行傅里叶逆变换处理,第二归一化层2044用于对块合并模块2041和傅里叶层2043的输出结果的拼接结果进行归一化处理,多层感知机2045用于对第二归一化层的输出结果进行特征提取处理。The above-mentioned classification network 204 includes a block merging module 2041, a first normalization layer 2042, a Fourier layer 2043, a second normalization layer 2044 and a multi-layer perceptron 2045, and the above-mentioned block merging module 2041 is used for input classification network 204 Combine processing, channel expansion and matrix transformation of the data, the first normalization layer 2042 is used to normalize the output result of the block merging module 2041, and the Fourier layer 2043 is used to normalize the first normalization layer 2042 Inverse Fourier transform processing is performed on the output result of the block merging module 2041 and the output result of the Fourier layer 2043, and the second normalization layer 2044 is used for normalization processing. Perform feature extraction processing on the output result of the second normalization layer.
例如,上述分类网络204中的块合并模块2041用于对分类网络203输出的第三特征进行合并处理以及通道扩展,比如,分类网络203输出的第三特征为2×49×512的特征图,块合并模块2041对第三特征中的图像块进行两两合并处理,得到7×7×512的图像;再对7×7×512的图像进行1024通道的扩展得到7×7×1024的图像;再对7×7×1024的图像 进行矩阵变换得到1×49×1024的图像;或者,块合并模块2041对第三特征中的图像块进行1024通道扩展,得到2×49×1024的图像;再对2×49×1024的图像进行两两合并处理,得到7×7×1024的图像;再对7×7×1024的图像进行矩阵变换得到1×49×1024的图像。上述第一归一化层2042对块合并模块2041的输出的1×49×1024的图像进行归一化处理,得到归一化后的1×49×1024图像;傅里叶层2043对归一化后的1×49×1024图像进行傅里叶逆变换处理,得到傅里叶逆变换后的1×49×1024图像;第二归一化层2044对块合并模块2041输出的1×49×1024图像和傅里叶逆变换后的1×49×1024图像的拼接结果进行归一化处理,得到第二归一化后的1×49×1024图像;多层感知机2045对第二归一化后的1×49×1024图像进行特征提取,得到1×49×1024的特征图(第四特征)。由此可见,本申请将分类网络202、分类网络203和分类网络204中复杂的自注意力层换成傅里叶层,不仅降低了分类网络202、分类网络203和分类网络204的复杂度,还节省了运算时间和内存占用。For example, the block merging module 2041 in the above-mentioned classification network 204 is used to merge and process the third feature output by the classification network 203 and channel expansion, for example, the third feature output by the classification network 203 is a feature map of 2×49×512, The block merging module 2041 performs pairwise merging processing on the image blocks in the third feature to obtain a 7×7×512 image; then expands the 7×7×512 image by 1024 channels to obtain a 7×7×1024 image; Then perform matrix transformation on the 7×7×1024 image to obtain a 1×49×1024 image; or, the block merging module 2041 performs 1024 channel expansion on the image block in the third feature to obtain a 2×49×1024 image; then The 2×49×1024 images are combined in pairs to obtain a 7×7×1024 image; then the 7×7×1024 image is transformed into a matrix to obtain a 1×49×1024 image. The above-mentioned first normalization layer 2042 normalizes the 1×49×1024 image output by the block merging module 2041 to obtain a normalized 1×49×1024 image; the Fourier layer 2043 normalizes The transformed 1×49×1024 image is subjected to inverse Fourier transform processing to obtain a 1×49×1024 image after inverse Fourier transform; the second normalization layer 2044 outputs the 1×49× The splicing result of the 1024 image and the 1×49×1024 image after the inverse Fourier transform is normalized to obtain the 1×49×1024 image after the second normalization; the multilayer perceptron 2045 performs the second normalization Feature extraction is performed on the transformed 1×49×1024 image to obtain a 1×49×1024 feature map (the fourth feature). It can be seen that the application replaces the complex self-attention layer in the classification network 202, the classification network 203 and the classification network 204 with the Fourier layer, which not only reduces the complexity of the classification network 202, the classification network 203 and the classification network 204, It also saves computing time and memory usage.
示例性地,通过Softmax函数处理至少一个分类网络的输出结果,生成至少一个概率值,该至少一个概率值用于指示目标图像属于至少一个图像类别的概率。Softmax函数对神经网络模型103中至少一个分类网络输出的结果进行处理,得到至少一个概率值,比如,Softmax函数对图3所示的神经网络模型103输出的第四特征图(即1×49×1024的特征图)进行处理,得到该第四特征图对应不同图片类别的概率(即至少一个概率值);将第四特征图对应不同图片类别的概率从高到低进行排序,将第四特征图对应的图片类别最大的概率确定为该第四特征图对应的图片类别(即至少一个图像类别的概率)。比如,不同图片类别有3种,这3种图片类别分别为动物类、人物类和风景类,若目标图片为自然风景,则该目标图片经过神经网络模型103输出的第四特征图,再经过Softmax函数将输出第四特征图分别对应上述3种图片类别的概率,比如,若第四特征图是动物类的概率为20%,第四特征图是人物类的概率为40%,第四特征图是风景类的概率为90%,则将第四特征图对应动物类、人物类和风景类的概率从高到低进行排序(即风景类的概率>人物类的概率>动物类的概率),最终将第四特征图确定为风景类(即图3中目标图像的图像分类结果)。Exemplarily, at least one output result of the classification network is processed through a Softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category. The Softmax function processes the result of at least one classification network output in the neural network model 103 to obtain at least one probability value. For example, the Softmax function outputs the fourth feature map (i.e. 1×49× 1024 feature map) to obtain the probability of the fourth feature map corresponding to different picture categories (that is, at least one probability value); sort the probability of the fourth feature map corresponding to different picture categories from high to low, and the fourth feature The maximum probability of the picture category corresponding to the map is determined as the picture category corresponding to the fourth feature map (ie, the probability of at least one image category). For example, there are 3 different picture categories, and these 3 picture categories are animals, people and landscapes respectively. If the target picture is a natural scenery, then the target picture passes through the fourth feature map output by the neural network model 103, and then passes through The Softmax function will output the probabilities of the fourth feature map corresponding to the above three image categories. For example, if the fourth feature map has an animal probability of 20%, the fourth feature map has a human probability of 40%, and the fourth feature The probability that the map is landscape is 90%, then the fourth feature map corresponds to the probability of animals, people and landscapes in order from high to low (that is, the probability of landscape > the probability of people > the probability of animals) , and finally determine the fourth feature map as the landscape class (that is, the image classification result of the target image in Figure 3).
示例性地,将神经网络模型103在ImageNet-1K数据集的前50类上进行训练,其中,训练集大小为64817张图片,验证集大小为2500张图片,一共50个类别。训练时,将网络预测结果(即神经网络模型103的输出结果)输入到Softmax函数转换成神经网络模型103的输出结果对应不同类别概率值,公式为:Exemplarily, the neural network model 103 is trained on the first 50 categories of the ImageNet-1K data set, wherein the size of the training set is 64817 pictures, the size of the verification set is 2500 pictures, and there are 50 categories in total. During training, the network prediction result (i.e. the output result of the neural network model 103) is input to the Softmax function and converted into the output result of the neural network model 103 corresponding to different category probability values, the formula is:
Figure PCTCN2021119682-appb-000003
Figure PCTCN2021119682-appb-000003
将Softmax函数的计算结果用于计算损失,公式为:The calculation result of the Softmax function is used to calculate the loss, the formula is:
Figure PCTCN2021119682-appb-000004
Figure PCTCN2021119682-appb-000004
其中p(x i)为真实概率分布,q(x i)为预测概率分布。训练时,采用Adam优化器和余弦衰减学习率调度器训练300个周期(即将训练数据集循环训练的次数),并且,每64张图片为一组(即批量尺寸(batch size)设置为64)输入到待训练的神经网络模型103中进行训练,直到将64817张图片全部训练完才算本次训练结束。 Among them, p( xi ) is the real probability distribution, and q( xi ) is the predicted probability distribution. During training, use the Adam optimizer and the cosine decay learning rate scheduler to train for 300 cycles (that is, the number of times the training data set is cyclically trained), and each 64 pictures is a group (that is, the batch size (batch size) is set to 64) Input to the neural network model 103 to be trained for training, until all 64817 pictures are trained, the training is considered as the end.
训练好神经网络模型103后,利用验证集测试神经网络模型103的性能。表1为神经网络模型103和Swin-B方法分别在ImageNet-1K数据集的前50类进行图像分类时的图像分类结果,其中,Methods表示为方法名称;ImageSize表示为训练时输入图像的大小;Param表示训练参数数量;Throughput(image/s)为吞吐量(即每秒处理图片的能力);ImageNet表示在ImageNet数据集上的图片分类准确的精度,即概率排名第一的图片类别与实际图片类别相符的准确率;FLOPs表示为网络模型所需要的计算力,FLOPs(floating-point operations)浮点运算,用于衡量网络模型的复杂度;Step表示批量尺寸为64时每个训练步骤所需时间;Epoch表示批大小为64时每个训练周期所需时间,该所需时间是在4个GPU(GeForce GTX1080)上测试获得的。与Swin-B方法相比,本申请的神经网络模型103参数减少32%,所需浮点运算减少33.7%,时间减少32.6%,精度达到了90%。After the neural network model 103 is trained, the performance of the neural network model 103 is tested using the verification set. Table 1 shows the image classification results when the neural network model 103 and the Swin-B method perform image classification on the top 50 categories of the ImageNet-1K data set, where Methods is the name of the method; ImageSize is the size of the input image during training; Param indicates the number of training parameters; Throughput (image/s) is throughput (that is, the ability to process pictures per second); ImageNet indicates the accuracy of image classification on the ImageNet dataset, that is, the picture category with the highest probability and the actual picture The accuracy rate of the category matching; FLOPs represents the computing power required by the network model, and FLOPs (floating-point operations) floating-point operations are used to measure the complexity of the network model; Step represents the time required for each training step when the batch size is 64 Time; Epoch indicates the time required for each training cycle when the batch size is 64. The required time is obtained by testing on 4 GPUs (GeForce GTX1080). Compared with the Swin-B method, the parameters of the neural network model 103 of the present application are reduced by 32%, the required floating-point operations are reduced by 33.7%, the time is reduced by 32.6%, and the accuracy reaches 90%.
表1在ImageNet-1数据集上前50类与其他方法的对比Table 1 Comparison of the top 50 classes with other methods on the ImageNet-1 dataset
Figure PCTCN2021119682-appb-000005
Figure PCTCN2021119682-appb-000005
相比现有Transformer神经网络模型中使用复杂的自注意力层对预设格式的特征图进行多次卷积处理,本申请将现有Transformer神经网络模型中复杂的自注意力层使用傅里叶逆变换代替,形成新Transformer神经网络模型;该新Transformer神经网络模型只需对预设格式的特征图进行一次傅里叶逆变换而不需要进行多次卷积处理就可以提取目标图像的特征并确定目标图像的图像分类结果;相比现有Transformer神经网络模型,该新Transformer神经网络模型在对目标图像进行特征提取以及图像分类的过程中计算时间和内存占用均有所减少。Compared with the complex self-attention layer used in the existing Transformer neural network model to perform multiple convolutions on the feature map of the preset format, this application uses the complex self-attention layer in the existing Transformer neural network model to use Fourier Instead of inverse transformation, a new Transformer neural network model is formed; the new Transformer neural network model only needs to perform one inverse Fourier transform on the feature map in the preset format without performing multiple convolutions to extract the features of the target image and Determine the image classification result of the target image; compared with the existing Transformer neural network model, the new Transformer neural network model reduces the calculation time and memory usage in the process of feature extraction and image classification of the target image.
此外,本申请还将上述分类网络中复杂的自注意力层使用傅里叶层代替,形成新分类网络;该新分类网络对目标图像进行图像分类;相比现有分类网络,该新分类网络在对目 标图像进行图像分类时计算时间和内存占用均有所减少。In addition, this application also replaces the complex self-attention layer in the above-mentioned classification network with a Fourier layer to form a new classification network; the new classification network performs image classification on the target image; compared with the existing classification network, the new classification network Computational time and memory usage are both reduced when performing image classification on target images.
图4示出了本申请提供了一种图像分类的装置结构示意图。图4中的虚线表示该单元或该模块为可选的。装置400可用于实现上述方法实施例中描述的方法。装置400可以是终端设备或服务器或芯片。FIG. 4 shows a schematic structural diagram of an image classification device provided by the present application. The dotted line in Figure 4 indicates that the unit or the module is optional. The device 400 may be used to implement the methods described in the foregoing method embodiments. The apparatus 400 may be a terminal device or a server or a chip.
装置400包括一个或多个处理器401,该一个或多个处理器401可支持装置400实现图2所对应方法实施例中的方法。处理器401可以是通用处理器或者专用处理器。例如,处理器401可以是中央处理器(central processing unit,CPU)。CPU可以用于对装置400进行控制,执行软件程序,处理软件程序的数据。装置400还可以包括通信单元405,用以实现信号的输入(接收)和输出(发送)。The device 400 includes one or more processors 401, and the one or more processors 401 can support the device 400 to implement the method in the method embodiment corresponding to FIG. 2 . The processor 401 may be a general purpose processor or a special purpose processor. For example, the processor 401 may be a central processing unit (central processing unit, CPU). The CPU can be used to control the device 400, execute software programs, and process data of the software programs. The device 400 may further include a communication unit 405, configured to implement signal input (reception) and output (transmission).
例如,装置400可以是芯片,通信单元405可以是该芯片的输入和/或输出电路,或者,通信单元405可以是该芯片的通信接口,该芯片可以作为终端设备的组成部分。For example, the apparatus 400 may be a chip, and the communication unit 405 may be an input and/or output circuit of the chip, or the communication unit 405 may be a communication interface of the chip, and the chip may serve as a component of a terminal device.
又例如,装置400可以是终端设备,通信单元405可以是该终端设备的收发器,或者,通信单元405可以是该终端设备的收发电路。For another example, the apparatus 400 may be a terminal device, and the communication unit 405 may be a transceiver of the terminal device, or the communication unit 405 may be a transceiver circuit of the terminal device.
装置400中可以包括一个或多个存储器402,其上存有程序404,程序404可被处理器401运行,生成指令403,使得处理器401根据指令403执行上述方法实施例中描述的方法。可选地,存储器402中还可以存储有数据(如待测芯片的ID)。可选地,处理器401还可以读取存储器402中存储的数据,该数据可以与程序404存储在相同的存储地址,该数据也可以与程序404存储在不同的存储地址。The device 400 may include one or more memories 402, on which there is a program 404, which can be run by the processor 401 to generate instructions 403, so that the processor 401 executes the methods described in the above method embodiments according to the instructions 403. Optionally, data (such as the ID of the chip to be tested) may also be stored in the memory 402 . Optionally, the processor 401 may also read data stored in the memory 402 , the data may be stored in the same storage address as the program 404 , and the data may also be stored in a different storage address from the program 404 .
处理器401和存储器402可以单独设置,也可以集成在一起,例如,集成在终端设备的系统级芯片(system on chip,SOC)上。The processor 401 and the memory 402 may be set separately, or may be integrated together, for example, integrated on a system-on-chip (system on chip, SOC) of a terminal device.
处理器401执行启动老化测试的方法的具体方式可以参见方法实施例中的相关描述。For a specific manner in which the processor 401 executes the method for starting the burn-in test, reference may be made to relevant descriptions in the method embodiments.
应理解,上述方法实施例的各步骤可以通过处理器401中的硬件形式的逻辑电路或者软件形式的指令完成。处理器401可以是CPU、数字信号处理器(digital signal processor,DSP)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,例如,分立门、晶体管逻辑器件或分立硬件组件。It should be understood that the steps in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor 401 . The processor 401 may be a CPU, a digital signal processor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices or discrete hardware components.
本申请还提供了一种计算机程序产品,该计算机程序产品被处理器401执行时实现本申请中任一方法实施例所述的方法。The present application also provides a computer program product, which implements the method described in any method embodiment in the present application when the computer program product is executed by the processor 401 .
该计算机程序产品可以存储在存储器402中,例如是程序404,程序404经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器401执行的可执行目标文件。The computer program product can be stored in the memory 402 , such as a program 404 , and the program 404 is finally converted into an executable target file that can be executed by the processor 401 through processes such as preprocessing, compiling, assembling and linking.
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例所述的方法。该计算机程序可以是高级语言程 序,也可以是可执行目标程序。The present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in the present application is implemented. The computer program may be a high-level language program or an executable object program.
该计算机可读存储介质例如是存储器402。存储器402可以是易失性存储器或非易失性存储器,或者,存储器402可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DRRAM)。The computer readable storage medium is, for example, the memory 402 . The memory 402 may be a volatile memory or a nonvolatile memory, or, the memory 402 may include both a volatile memory and a nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM ) and direct memory bus random access memory (direct rambus RAM, DRRAM).
本领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和设备的具体工作过程以及产生的技术效果,可以参考前述方法实施例中对应的过程和技术效果,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process and technical effects of the devices and equipment described above can refer to the corresponding processes and technical effects in the foregoing method embodiments, here No longer.
在本申请所提供的几个实施例中,所揭露的系统、装置和方法,可以通过其它方式实现。例如,以上描述的方法实施例的一些特征可以忽略,或不执行。以上所描述的装置实施例仅仅是示意性的,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,多个单元或组件可以结合或者可以集成到另一个系统。另外,各单元之间的耦合或各个组件之间的耦合可以是直接耦合,也可以是间接耦合,上述耦合包括电的、机械的或其它形式的连接。In the several embodiments provided in this application, the disclosed systems, devices and methods may be implemented in other ways. For example, some features of the method embodiments described above may be omitted, or not implemented. The device embodiments described above are only illustrative, and the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system. In addition, the coupling between the various units or the coupling between the various components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制。尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换,而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solution of the present application, but not to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: they can still modify the technical solutions described in the aforementioned embodiments, or perform equivalent replacements for some of the technical features, and these Any modification or replacement that does not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present application shall be included within the protection scope of the present application.

Claims (8)

  1. 一种图像分类的方法,其特征在于,所述方法包括:A method for image classification, characterized in that the method comprises:
    获取目标图像;Get the target image;
    对所述目标图像进行预处理,生成预设格式的特征图;Preprocessing the target image to generate a feature map in a preset format;
    对所述预设格式的特征图进行傅里叶逆变换,生成第一逆变换结果;performing an inverse Fourier transform on the feature map in the preset format to generate a first inverse transform result;
    对所述预设格式的特征图和所述第一逆变换结果进行拼接处理,生成第一拼接结果;performing splicing processing on the feature map in the preset format and the first inverse transformation result to generate a first splicing result;
    对所述第一拼接结果进行特征提取,生成第一特征;performing feature extraction on the first splicing result to generate a first feature;
    根据所述第一特征确定图像分类结果。An image classification result is determined according to the first feature.
  2. 根据权利要求1所述的方法,其特征在于,根据所述第一特征确定图像分类结果,包括:The method according to claim 1, wherein determining the image classification result according to the first feature comprises:
    对所述第一特征和所述第一逆变换结果进行拼接处理,生成第二拼接结果;performing splicing processing on the first feature and the first inverse transformation result to generate a second splicing result;
    通过至少一个分类网络对所述第二拼接结果进行分类处理,其中,所述至少一个分类网络中任意一个分类网络包括块合并模块、第一归一化层、傅里叶层、第二归一化层和多层感知机,所述块合并模块用于对输入所述分类网络的数据进行合并处理,所述第一归一化层用于对所述块合并模块的输出结果进行归一化处理,所述傅里叶层用于对所述第一归一化层的输出结果进行傅里叶逆变换处理,所述第二归一化层用于对所述块合并模块和所述傅里叶层的输出结果的拼接结果进行归一化处理,所述多层感知机用于对所述第二归一化层的输出结果进行特征提取处理。The second splicing result is classified by at least one classification network, wherein any classification network in the at least one classification network includes a block merging module, a first normalization layer, a Fourier layer, a second normalization A normalization layer and a multi-layer perceptron, the block merging module is used to merge the data input to the classification network, and the first normalization layer is used to normalize the output results of the block merging module processing, the Fourier layer is used to perform inverse Fourier transform processing on the output result of the first normalization layer, and the second normalization layer is used to perform the inverse Fourier transform on the block merging module and the Fourier The splicing result of the output result of the inner leaf layer is normalized, and the multi-layer perceptron is used to perform feature extraction processing on the output result of the second normalization layer.
  3. 根据权利要求2所述的方法,其特征在于,还包括:The method according to claim 2, further comprising:
    通过softmax函数处理所述至少一个分类网络的输出结果,生成至少一个概率值,所述至少一个概率值用于指示所述目标图像属于至少一个图像类别的概率。The output result of the at least one classification network is processed through a softmax function to generate at least one probability value, and the at least one probability value is used to indicate the probability that the target image belongs to at least one image category.
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述傅里叶逆变换为快速傅里叶逆变换。The method according to any one of claims 1 to 3, wherein the inverse Fourier transform is an inverse fast Fourier transform.
  5. 根据权利要求1至3中任一项所述的方法,其特征在于,所述傅里叶逆变换的公式为:The method according to any one of claims 1 to 3, wherein the formula of the inverse Fourier transform is:
    Figure PCTCN2021119682-appb-100001
    Figure PCTCN2021119682-appb-100001
    其中,x n为时域离散信号,x k为频域离散信号,N为时域采样点数,k为当前采样点。 Among them, x n is the discrete signal in the time domain, x k is the discrete signal in the frequency domain, N is the number of sampling points in the time domain, and k is the current sampling point.
  6. 根据权利要求1至3中任一项所述的方法,其特征在于,所述对所述目标图像进行 预处理,包括:The method according to any one of claims 1 to 3, wherein said preprocessing the target image comprises:
    对所述目标图像进行块分割处理,生成块分割结果;performing block segmentation processing on the target image to generate a block segmentation result;
    通过线性嵌入层对所述块分割结果进行处理,生成所述预设格式的特征图。The block segmentation result is processed through a linear embedding layer to generate a feature map in the preset format.
  7. 一种图像分类的装置,其特征在于,所述装置包括处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于从所述存储器中调用并运行所述计算机程序,使得所述装置执行权利要求1至6中任一项所述的方法。An image classification device, characterized in that the device includes a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, so that the The device performs the method of any one of claims 1 to 6.
  8. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储了计算机程序,当所述计算机程序被处理器执行时,使得处理器执行权利要求1至6中任一项所述的方法。A computer-readable storage medium, characterized in that, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the processor is made to execute any one of claims 1 to 6. described method.
PCT/CN2021/119682 2021-09-22 2021-09-22 Image classification method and apparatus WO2023044612A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/119682 WO2023044612A1 (en) 2021-09-22 2021-09-22 Image classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/119682 WO2023044612A1 (en) 2021-09-22 2021-09-22 Image classification method and apparatus

Publications (1)

Publication Number Publication Date
WO2023044612A1 true WO2023044612A1 (en) 2023-03-30

Family

ID=85719789

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119682 WO2023044612A1 (en) 2021-09-22 2021-09-22 Image classification method and apparatus

Country Status (1)

Country Link
WO (1) WO2023044612A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091340A (en) * 2014-07-18 2014-10-08 厦门美图之家科技有限公司 Blurred image rapid detection method
CN109712119A (en) * 2018-12-13 2019-05-03 深圳先进技术研究院 A kind of magnetic resonance imaging and patch recognition methods and device
CN109964250A (en) * 2016-12-12 2019-07-02 德州仪器公司 For analyzing the method and system of the image in convolutional neural networks
US20190230380A1 (en) * 2018-01-25 2019-07-25 Fujitsu Limited Data compression apparatus and data compression method
CN111012336A (en) * 2019-12-06 2020-04-17 重庆邮电大学 Parallel convolutional network motor imagery electroencephalogram classification method based on spatio-temporal feature fusion
CN112232448A (en) * 2020-12-14 2021-01-15 北京大恒普信医疗技术有限公司 Image classification method and device, electronic equipment and storage medium
CN113361636A (en) * 2021-06-30 2021-09-07 山东建筑大学 Image classification method, system, medium and electronic device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091340A (en) * 2014-07-18 2014-10-08 厦门美图之家科技有限公司 Blurred image rapid detection method
CN109964250A (en) * 2016-12-12 2019-07-02 德州仪器公司 For analyzing the method and system of the image in convolutional neural networks
US20190230380A1 (en) * 2018-01-25 2019-07-25 Fujitsu Limited Data compression apparatus and data compression method
CN109712119A (en) * 2018-12-13 2019-05-03 深圳先进技术研究院 A kind of magnetic resonance imaging and patch recognition methods and device
CN111012336A (en) * 2019-12-06 2020-04-17 重庆邮电大学 Parallel convolutional network motor imagery electroencephalogram classification method based on spatio-temporal feature fusion
CN112232448A (en) * 2020-12-14 2021-01-15 北京大恒普信医疗技术有限公司 Image classification method and device, electronic equipment and storage medium
CN113361636A (en) * 2021-06-30 2021-09-07 山东建筑大学 Image classification method, system, medium and electronic device

Similar Documents

Publication Publication Date Title
CN108509915B (en) Method and device for generating face recognition model
US10936919B2 (en) Method and apparatus for detecting human face
US10810735B2 (en) Method and apparatus for analyzing medical image
WO2021036471A1 (en) Sample generation method and apparatus, and computer device and storage medium
CN110866471A (en) Face image quality evaluation method and device, computer readable medium and communication terminal
US20190087648A1 (en) Method and apparatus for facial recognition
WO2020038205A1 (en) Target detection method and apparatus, computer-readable storage medium, and computer device
WO2020147445A1 (en) Rephotographed image recognition method and apparatus, computer device, and computer-readable storage medium
WO2022142450A1 (en) Methods and apparatuses for image segmentation model training and for image segmentation
US20190087647A1 (en) Method and apparatus for facial recognition
WO2016124103A1 (en) Picture detection method and device
US11250292B2 (en) Method and apparatus for generating information
WO2023065503A1 (en) Facial expression classification method and electronic device
Xu et al. Pig face recognition based on trapezoid normalized pixel difference feature and trimmed mean attention mechanism
WO2021051497A1 (en) Pulmonary tuberculosis determination method and apparatus, computer device, and storage medium
CN110378203B (en) Image processing method, device, terminal and storage medium
CN108509994B (en) Method and device for clustering character images
CN113269149B (en) Method and device for detecting living body face image, computer equipment and storage medium
CN114241505B (en) Method and device for extracting chemical structure image, storage medium and electronic equipment
WO2022111387A1 (en) Data processing method and related apparatus
CN113569740B (en) Video recognition model training method and device, and video recognition method and device
CN111062324A (en) Face detection method and device, computer equipment and storage medium
WO2021169366A1 (en) Data enhancement method and apparatus
CN108399401B (en) Method and device for detecting face image
CN110807472A (en) Image recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21957756

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE