WO2024066697A1 - 一种图像处理方法和相关装置 - Google Patents

一种图像处理方法和相关装置 Download PDF

Info

Publication number
WO2024066697A1
WO2024066697A1 PCT/CN2023/108785 CN2023108785W WO2024066697A1 WO 2024066697 A1 WO2024066697 A1 WO 2024066697A1 CN 2023108785 W CN2023108785 W CN 2023108785W WO 2024066697 A1 WO2024066697 A1 WO 2024066697A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
features
feature
processed
global
Prior art date
Application number
PCT/CN2023/108785
Other languages
English (en)
French (fr)
Inventor
蒋兴华
刘皓
李鑫
姜德强
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024066697A1 publication Critical patent/WO2024066697A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to image processing technology.
  • image recognition technology which can be used to classify images.
  • Image classification can include, for example, image content classification, recognition of text in images, and identification of whether images are compliant.
  • a transformer-based solution is used to learn image features of the processed image, and then determine the classification result based on the learned image features.
  • the present application provides an image processing method and related devices, which can accurately classify the image to be processed, thereby improving the classification ability and classification effect.
  • an embodiment of the present application provides an image processing method, which is executed by a computer device, and the method includes:
  • the classification module in the image classification model is used to perform category prediction based on the image features to obtain a classification result of the image to be processed.
  • an embodiment of the present application provides an image processing device, the device comprising an acquisition unit, a determination unit, a mapping unit and a prediction unit:
  • the acquisition unit is used to acquire the image to be processed
  • the determining unit is used to perform vectorization processing based on the image to be processed to obtain an image representation vector of the image to be processed;
  • the mapping unit is used to perform feature mapping on the image representation vector through a network unit included in a feature mapping module in an image classification model to obtain image features of the image to be processed;
  • the mapping unit is specifically used to, in the process of obtaining the image features through the network unit, perform global feature mapping on the input content through the network layer to obtain global features, and perform local feature mapping on the input content through the network layer to obtain local features, wherein the input content is obtained according to the image representation vector; perform feature fusion on the global features and the local features through the network layer to obtain fused features corresponding to the network layer; and obtain the image features based on the fused features corresponding to the network layer;
  • the prediction unit is used to perform category prediction based on the image features through the classification module in the image classification model to obtain the classification result of the image to be processed.
  • an embodiment of the present application provides a computer device, the computer device comprising a processor and a memory:
  • the memory is used to store program code and transmit the program code to the processor
  • the processor is configured to execute the method described in any one of the preceding aspects according to instructions in the program code.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium is used to store program code, and when the program code is executed by a processor, the processor executes the method described in any of the aforementioned aspects.
  • an embodiment of the present application provides a computer program product, including a computer program, which implements the method described in any of the aforementioned aspects when executed by a processor.
  • vectorization processing can be performed based on the image to be processed to obtain the image representation vector of the image to be processed.
  • the network unit included in the feature mapping module in the image classification model performs feature mapping on the image representation vector to obtain the image features of the image to be processed.
  • the input content obtained according to the image representation vector is respectively subjected to global feature mapping and local feature mapping through the network layer to obtain global features and local features, and the global features and local features are subjected to feature fusion through the network layer to obtain the fusion features corresponding to the network layer, and the image features are obtained based on the fusion features corresponding to the network layer. Since the final image features are obtained by fusion of multiple features in the same network layer, local features and global features can be learned at the same time, solving the problem caused by not paying attention to local features in the transformer solution, and improving the fusion ability between multiple features.
  • the classification result is obtained by performing category prediction based on the image features through the classification module, since the local features and global features are fused in the image features, even when facing images with similar object appearance features, the image to be processed can be accurately classified, thereby improving the classification ability and classification effect.
  • FIG1 is a framework diagram of an image processing method provided by the related art
  • FIG2 is an application scenario architecture diagram of an image processing method provided by an embodiment of the present application.
  • FIG3 is a flow chart of an image processing method provided in an embodiment of the present application.
  • FIG4 is an example diagram of a feature relationship between words in a learning language sequence provided by an embodiment of the present application.
  • FIG5 is an example diagram of cutting and dividing an image to be processed provided by an embodiment of the present application.
  • FIG6 is an example diagram of performing Flatten processing on an image block provided by an embodiment of the present application.
  • FIG7 is a diagram showing an example of a network layer structure provided in an embodiment of the present application.
  • FIG8 is an example diagram of an image to be processed including objects of different sizes provided by an embodiment of the present application.
  • FIG9 is a structural example diagram of an image classification model including multiple network units provided in an embodiment of the present application.
  • FIG10 is a diagram showing an exemplary principle of local-attention provided in an embodiment of the present application.
  • FIG11 is a flowchart of a hierarchical feature fusion solution provided by the related art.
  • FIG12 is a schematic diagram showing a convolutional neural network according to an embodiment of the present application.
  • FIG13 is a flowchart illustrating a process for determining fusion features according to an embodiment of the present application.
  • FIG14 is an example diagram of a valid feature and an invalid feature provided in an embodiment of the present application.
  • FIG15 is a flowchart illustrating an example of a process for determining a weight value provided in an embodiment of the present application
  • FIG16 is a schematic diagram showing a fully connected layer according to an embodiment of the present application.
  • FIG17 is a schematic diagram showing a principle of a softmax function provided in an embodiment of the present application.
  • FIG18 is an example diagram of the overall framework of an image processing method provided in an embodiment of the present application.
  • FIG19 is a structural diagram of an image processing device provided in an embodiment of the present application.
  • FIG20 is a structural diagram of a terminal provided in an embodiment of the present application.
  • FIG. 21 is a structural diagram of a server provided in an embodiment of the present application.
  • Image classification is a problem of describing the category of image content. For example, if the image content is an elephant, the model or algorithm needs to identify it as an elephant.
  • image classification can include, for example, image content classification, recognition of text in the image, and identification of whether the image is compliant.
  • Image content classification can, for example, be identification of which category the object included in the image belongs to. Categories can include, for example, birds, balls, cars, etc.
  • Identification of text in an image can, for example, be identification of what text is included in the image. At this time, each known text can be taken as a category to identify which category the text included in the image belongs to, and then identify what the text in the image is. When identifying whether an image is compliant, the category can be, for example, compliant or non-compliant, so as to identify which category the image belongs to.
  • a transformer-based solution can be used to learn image features of the processed image, and then determine the classification result based on the learned image features.
  • the image to be processed is input into the encoder (Encoder) of the Transformer framework, which can be called Transformer Encoder.
  • Transformer Encoder can be a network layer containing multiple self-attention mechanisms.
  • the Transformer Encoder is used to learn the features of the processed image, and the classification result is obtained by classifying it based on the learned image features.
  • the classifier can be an MLP Head, which is a layer structure used for classification.
  • the categories can include birds, balls, cars, etc., and the classification results can be determined based on the probability values output on each category.
  • the transformer scheme usually only uses the global self-attention method to learn image features. This method does not focus on learning local image features. If the objects included in the image to be processed have similar appearance features to those included in other images, it is difficult to accurately classify the image to be processed in this way. That is, this method is insufficient in classifying images with similar object appearance features and the classification effect is poor.
  • the embodiment of the present application provides an image processing method, which fuses multiple features in the same network layer to obtain the final image features, so it can learn local features and global features at the same time, solve the problem caused by the transformer solution not paying attention to local features, and at the same time improve the fusion ability between multiple features, and solve the problem of unintelligent feature fusion in the layered feature fusion solution.
  • the classification result is obtained by prediction based on image features through the classification module, since the local features and global features are fused in the image features, even when facing images with similar object appearance features, the image to be processed can be accurately classified, thereby improving the classification ability and classification effect.
  • the image processing method provided in the embodiment of the present application is applicable to various image classification scenarios, including image content classification, recognition of text in images, identification of whether images are compliant, video classification (in this case, the image to be processed can be an image frame in the video, and the classification result of the video is obtained by processing the image to be processed), etc.
  • the image processing method provided in the embodiment of the present application can be executed by a computer device, which can be, for example, a server or a terminal.
  • the server can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.
  • the terminal includes but is not limited to a smart phone, a computer, an intelligent voice interaction device, a smart home appliance, a vehicle terminal, an aircraft, etc.
  • the method provided in the embodiments of the present application can be applied to various scenarios such as cloud technology, artificial intelligence, smart transportation, and assisted driving.
  • FIG 2 shows an application scenario architecture diagram of an image processing method provided in an embodiment of the present application.
  • the application scenario is introduced by taking the server executing the image processing method provided in an embodiment of the present application as an example.
  • the application scenario may include a server 200, and when the image to be processed needs to be classified, the server 200 may obtain the image to be processed.
  • the image to be processed is an image to be classified, and the image to be processed may be an image directly captured, an existing image stored in the server 200, or an image frame obtained by dividing a video into frames, etc.
  • the image to be processed may include an object, and the object may be various objects, animals, plants, text, etc.
  • An image classification model may be run in the server 200.
  • the image classification model may include a feature mapping module and a classification module.
  • the feature mapping module includes a network unit (block), and a network unit includes a network layer (layer).
  • the server 200 may perform vectorization processing based on the image to be processed through the image classification model to obtain an image representation vector of the image to be processed, and then perform feature mapping on the image representation vector through the network unit to obtain image features of the image to be processed.
  • the server 200 performs global feature mapping and local feature mapping on the input content obtained according to the image representation vector through the network layer to obtain global features and local features, and performs feature fusion on the global features and local features through the network layer to obtain the fusion features corresponding to the network layer, and obtains the image features based on the fusion features corresponding to the network layer. Since multiple features are fused in the same network layer to obtain the final image features, local features and global features can be learned at the same time, solving the problem caused by not paying attention to local features in the transformer solution, and at the same time improving the fusion capability between multiple features.
  • the server 200 can predict and obtain classification results based on the image features through the classification module. Since the image features integrate local features and global features, even when facing images with similar object appearance features, the image to be processed can be accurately classified, thereby improving the classification ability and classification effect.
  • the method provided in the embodiments of the present application mainly involves computer vision technology (Computer Vision, CV) and machine learning technology (Machine Learning, ML) in the field of artificial intelligence.
  • Computer Vision Computer Vision
  • ML Machine Learning
  • FIG. 3 shows a flow chart of an image processing method, the method comprising:
  • the server may obtain the image to be processed.
  • the image to be processed obtained by the server needs to be input into the image classification model so as to classify the image to be processed.
  • the image input into the image classification model can be set to a preset resolution (such as 224*224) according to different business content needs.
  • the image to be processed obtained by the server can be an image with a resolution equal to the preset resolution.
  • the resolution of the original image obtained by the server may be inconsistent with the preset resolution.
  • the way in which the server obtains the image to be processed may be that after obtaining the original image, the server may first determine whether the resolution of the original image is consistent with the preset resolution. If they are consistent, the server may directly use the original image as the image to be processed; if they are inconsistent, the server may resize the original image so that its resolution is equal to the preset resolution, and then use the original image after the size change as the image to be processed.
  • S302 Perform vectorization processing based on the image to be processed to obtain an image representation vector of the image to be processed.
  • the server After the server obtains the image to be processed, it can input the image to be processed into the image classification model, so that the image classification model performs vectorization processing based on the image to be processed to obtain an image representation vector.
  • the server can also use a related image vector conversion model to perform vectorization processing on the image to be processed to obtain an image representation vector;
  • the image vector conversion model is a model for converting an image into a corresponding image representation vector, which is independent of the above-mentioned image classification model.
  • the embodiments of the present application do not impose any limitation on this.
  • the image classification model is a neural network model, for example, it can be a Transformer model, or it can be other models that can implement the method provided in the embodiment of the present application.
  • the embodiment of the present application does not limit the structure of the image classification model.
  • the image classification model can be pre-trained. During its training process, training samples are first obtained. The processing process of the training samples is similar to the processing method of the processed images in the embodiment of the present application. However, it may be necessary to optimize and adjust the model based on the prediction results until an image classification model that meets the requirements is obtained.
  • the Transformer model is a language model based on the self-attention mechanism, the core of which is the self-attention mechanism, which can learn the intrinsic dependency between data.
  • Figure 4 shows a schematic diagram of learning the feature relationship between words in a language sequence.
  • the language sequence can be "The animal didn’t cross the street because it was too tired.”
  • the feature relationship between words is learned.
  • the feature relationship can be the intrinsic dependency between words (i.e., data) learned by the Transformer model using the self-attention mechanism.
  • the image (such as the image to be processed) can be divided into several image blocks (patches), and the patches are used as units in the image block sequence to learn the feature relationship between the units in the image block sequence.
  • the image to be processed in Figure 5 is divided into 3*3 patches, and the 3*3 patches are arranged in order to obtain an image block sequence, and then the feature relationship between the patches in the image block sequence is learned.
  • a method of performing vectorization processing on the image to be processed to obtain an image representation vector of the image to be processed can be: cutting the image to be processed into blocks according to the block size (patch_size) to obtain multiple image blocks, performing data structure mapping on the multiple image blocks to obtain one-dimensional structural data of the image to be processed, and then performing vectorization processing on the one-dimensional structural data to obtain the image representation vector.
  • patch_size is the size of each image block, and patch_size can be preset according to actual needs. Usually, patch_size must be able to divide the preset resolution. For example, if the preset resolution is 224*224, the patch size can be 8, 14, 16, etc. If patch_size is 16, the image to be processed is divided into 14*14 patches in total. In a possible implementation, the embodiment of the present application can cut the image to be processed into 56*56 patches, and the patch_size is 4.
  • the data structure mapping method may include multiple methods, such as flattening processing and linear projection processing.
  • Flattening can be performed once for each patch after cutting, so that it is mapped from two-dimensional structure data to one-dimensional structure data suitable for image classification model processing.
  • Figure 6 takes the image to be processed being cut into blocks to obtain 3*3 patches as an example, see 601 in Figure 6, wherein each patch can be identified by a number, which are 0, 1, 2, ..., 7, 8 respectively, and each patch is flattened once, so that the 3*3 patches are arranged in order into a sequence to obtain one-dimensional structure data, see 602 in Figure 6.
  • the number of patches is configured to be 56*56
  • the size of a single patch is 4*4
  • the input image is a 3-channel color image.
  • the two-dimensional structure data of the patch can be [3,4,4], which is mapped to one-dimensional structure data [128] after Flatten processing.
  • the length of the one-dimensional structure data can be set according to the business.
  • LinearProjection processing may refer to mapping multi-dimensional structure data of C*H*W (number of channels*height*width) into one-dimensional structure data for subsequent learning.
  • the position and order of characters in a sentence are very important. They are not only part of the grammatical structure of a sentence, but also an important concept for expressing semantics. If a character is in a different position or order in a sentence, the meaning of the entire sentence may be deviated. Therefore, when the Transformer model is applied to a language model, it is a very important strategy to determine the position encoding that can reflect the position of a word in a sentence.
  • the Transformer model when it is applied to the image field, that is, when it is applied to the method provided in the embodiment of the present application, it is still necessary to refer to the position encoding for image classification, that is, the position encoding of the image block is still an important strategy for image classification tasks.
  • the position vectors corresponding to each of the multiple image blocks can also be obtained.
  • the one-dimensional structural data when the one-dimensional structural data is vectorized to obtain the image representation vector, the one-dimensional structural data can be vectorized to obtain the block vector corresponding to each image block, and then based on the block vector corresponding to each image block (obtained by patch embedding through the image block embedding (patch embedding) module) and the position vector (obtained by pos embedding through the position embedding (pos embedding) module), the image representation vector is obtained.
  • the image representation vector is obtained according to the position vector, so that it can be used in the subsequent
  • the position information of the image block in the image to be processed is obtained, so that the information based on which the subsequent classification is based is richer, thereby improving the classification ability and classification accuracy.
  • the image classification model includes a feature mapping module and a classification module
  • the feature mapping module may include a network unit (block)
  • a network unit may include a network layer (layer).
  • a block may include at least one layer, and the neural network needs to perform feature mapping repeatedly from shallow to deep.
  • Each feature mapping can be called a layer, as shown in Figure 7, which shows a neural network including 4 layers.
  • the feature mapping module may include at least one block.
  • the objects in the image to be processed may be large or small.
  • the objects in the images to be processed shown in (a), (b), and (c) in FIG8 are birds, and the sizes of the birds in the images to be processed shown in (a), (b), and (c) are different.
  • the commonly used feature mapping module may include multiple blocks, so that the recognition learning for the image to be processed can be carried out at different scales, so as to learn features of different scales, where each block corresponds to a scale. For example, if the resolution of the image to be processed is 224*224, then the features can be learned at 56*56 scale, 24*24 scale, and 14*14 scale respectively.
  • the feature mapping module of the image classification model can include multiple blocks, and a single block contains multiple layers.
  • the structure of a typical image classification model is shown in Figure 9: This image classification model contains four network units, and the image resolutions processed by each block are 1/4, 1/8, 1/16, and 1/32 respectively. Each block can contain different numbers of layers according to its own network characteristics.
  • the server may perform feature mapping on the image representation vector through a feature mapping module to obtain image features of the image to be processed. Specifically, the server may perform feature mapping on the image representation vector through a network unit included in the feature mapping module to obtain image features of the image to be processed.
  • a feature mapping module may include multiple network units, each of which is connected to a down-sampling module, wherein each network unit is used to perform feature mapping to obtain a feature map at a corresponding scale, and the down-sampling module is used to reduce the scale of the feature map output by the connected network unit, thereby obtaining a feature map of another scale, so that the next network unit can learn features at another scale to obtain a feature map.
  • the number of network units can be determined according to actual business needs, and the embodiment of the present application does not limit the number of network units.
  • the first network unit can perform feature mapping on the image representation vector to obtain a first feature map of the image to be processed, and then the first feature map can be downsampled by the downsampling module connected to the first network unit to obtain a feature map of the first scale.
  • the second network unit performs feature mapping on the feature map of the first scale to obtain a second feature map of the image to be processed, and then the downsampling module connected to the second network unit downsamples the second feature map.
  • the feature map of the second scale is processed in the same way to obtain the feature map of the second scale, and then the image features of the image to be processed are obtained according to the feature map of the second scale.
  • the feature mapping module also includes other network units, after obtaining the feature map of the second scale, the above method can be used in sequence until the last network unit and its connected downsampling module complete the processing to obtain the final image features.
  • the feature map can also be called a feature mapping map.
  • the data obtained after processing/mapping the data (such as the image representation vector of the embodiment of the present application) by a certain method is generally called a feature map.
  • a neural network is an algorithm system that maps high-dimensional data to low-dimensional data through multiple feature mappings.
  • a color image of size 1024*1024 (such as the image to be processed) has an original dimension of 3*1024*1024. If 100 categories are set, then after layers of feature mapping, it will eventually become 100 dimensions, where the value of each dimension corresponds to the probability value of a category, and the size of the probability value reflects the probability of the image to be processed being judged as that category.
  • the feature mapping module consists of 1 to n blocks (typically 4 blocks, namely block1, block2, block3, and block4), each of which contains different numbers of layers with the same structure.
  • n blocks typically 4 blocks, namely block1, block2, block3, and block4.
  • a downsampling module is required after each block to reduce the size of the feature map by one half.
  • the size transformation in the process of obtaining image features is as follows: the size of the image to be processed is 3*224*224, the size after cutting and dividing into patches is 3*56*56*4*4, and the size after patch embedding and pos embedding is 56*56*128.
  • the size of the feature map after downsampling by the downsampling module is 28*28*256
  • the size of the feature map after downsampling by the downsampling module is 14*14*512
  • after block3 is processed
  • the size of the feature map after downsampling by the downsampling module is 7*7*1024
  • after block4 is processed, the size of the feature map after downsampling by the downsampling module is 1*1024.
  • the embodiment of the present application can learn image features at multiple scales by using multiple network units, so that objects of different scales in the processed image can be accurately identified, thereby improving classification and recognition capabilities.
  • the network layer performs global feature mapping on the input content to obtain global features, and the network layer performs local feature mapping on the input content to obtain local features, and the input content is obtained according to the image representation vector; the network layer performs feature fusion on the global features and the local features to obtain the fused features corresponding to the network layer; and the image features are obtained based on the fused features corresponding to the network layer.
  • the way of processing the input content in the layer is the core content of the embodiment of the present application.
  • multiple features can be extracted from the input content in multiple ways in the same network layer to obtain multiple features, thereby fusing the multiple features to obtain fused features, and then obtaining the final image features based on the fused features corresponding to the network layer.
  • the input content can be mapped globally through the network layer to obtain global features, and the input content can be mapped locally through the network layer to obtain local features, and then the global features and local features can be fused to obtain the fused features corresponding to the network layer, and then the image features can be obtained based on the fused features corresponding to the network layer.
  • the input content may be obtained based on the image representation vector.
  • its input content may be the image representation vector;
  • the network layer is another network layer, since the image representation vector has been processed by the network layer before the network layer, the input content of the network layer may be the result obtained by the previous network layer, for example, it may be a feature map obtained by the previous network layer.
  • Global feature mapping can be implemented by any method that can obtain global features.
  • global feature mapping can be implemented by a global-attention mechanism, which can be a self-attention mechanism. (self-attention).
  • Each unit of self-attention will learn the feature relationship with all other units, so self-attention can be a global attention mechanism.
  • each image block in Figure 5 can be used as a unit, so that for each image block, the feature relationship between the image block and all other image blocks is learned.
  • Local feature mapping can be implemented by any method that can obtain local features, for example, local feature mapping can be implemented by a local-attention mechanism.
  • the number of units included in the sequence is particularly large.
  • the sequence is grouped, and self-attention is used in each group, and the model parameters are shared between the groups. This method only learns the feature relationship in a certain local area.
  • the mechanism is called local-attention.
  • the entire image (such as the image to be processed) is divided into four blocks, which can be seen in Figure 10 as 1001, 1002, 1003, and 1004. Each block is divided into multiple image blocks. For example, the block shown in 1001 is divided into 4*4 image blocks. The other three blocks are similar.
  • Self-attention is used for multiple image blocks in each block.
  • the hierarchical feature fusion scheme uses two methods to learn image features: global-attention and local-attention.
  • Each network layer specializes in learning one feature, and different layers use different features alternately.
  • network layer 1 uses the global attention mechanism to learn global features
  • network layer 2 uses the local attention mechanism to learn local features. Due to the alternating use of local-attention and global-attention, the fusion ability between the two image features is weak; in addition, the semantic scopes relied on at different spatial positions are different, some features are only local features, and some features are global features.
  • the hierarchical feature fusion scheme learns global features or local features each time. As a result, the network layer that learns local features will ignore global information, while the network layer that learns global features will ignore local information.
  • the embodiments of the present application can obtain multiple features such as global features and local features in the same network layer, and fuse multiple features in the same layer to solve the problem that the transformer does not focus on local features. At the same time, it solves the problems of unintelligent feature fusion, weak feature fusion capability, and incomplete learning features in the layered feature fusion scheme.
  • the various features learned by the embodiment of the present application at the same network layer may include not only local features and global features, but also other features.
  • the embodiment of the present application does not limit the number and type of the various features learned by the same network layer.
  • the various features learned by the same network layer may also include convolutional features. Specifically, the convolutional features can be obtained by performing convolutional feature mapping on the image representation vector through the network layer, and then the global features, local features and convolutional features can be fused through the network layer to obtain the fused features corresponding to the network layer.
  • the convolution feature map can be realized by a convolutional neural network (CNN), which is a type of feedforward neural network that includes convolution or related calculations and has a deep structure. Its core is the convolution operator.
  • CNN features The features obtained by the convolution operator through the convolution kernel to process the image are called convolution features (i.e., CNN features).
  • CNN features are a kind of features for learning local dependencies.
  • 1201 is the image to be processed.
  • the convolution kernel shown in 1202 the convolution feature shown in 1203 can be obtained. From Figure 12, it can be seen that each feature unit (token) has a relationship with the 8 units around it.
  • FIG 13 an example diagram of the processing flow for determining fusion features can be seen in Figure 13.
  • convolutional features are obtained through a convolutional neural network
  • global features are obtained through a global attention mechanism
  • local features are obtained through a local attention mechanism
  • convolutional features, global features, and local features are fused to obtain fusion features.
  • the feature fusion method can be to perform weighted summation of multiple features.
  • the weight values corresponding to the multiple features can be calculated based on the features, or can be pre-set. In one possible case, the sum of the weight values of the multiple features can be 1.
  • the global features and local features are fused through the network layer to obtain the corresponding fused features of the network layer by determining the weight values corresponding to the global features and the local features respectively, the sum of the weight values of the global features and the weight values of the local features is 1, and then the global features and the local features are weighted and summed according to the weight values to obtain the fused features.
  • the weight values corresponding to the convolution features, global features and local features are determined.
  • the sum of the weight values of the convolution features, the global features and the local features is 1.
  • the convolution features, global features and local features are weighted and summed according to the weight values to obtain the fused features.
  • the feature fusion methods may vary based on the different distribution of weight values of multiple features.
  • One is the soft fusion method, which uses a vector with a sum of 1 (each vector is the weight value corresponding to a feature) to perform weighted summation on multiple features. For example, if multiple features include three features, the vectors corresponding to the multiple weight values can be expressed as [0.2, 0.5, 0.3]. If multiple features include two features, the vectors corresponding to the multiple weight values can be expressed as [0.4, 0.6].
  • some features may be invalid or harmful features, as shown in Figure 14, which shows what are valid features and invalid features.
  • the classification target of the image to be processed is the dog in the image.
  • the image features that are helpful for identifying the dog are basically within the circular dotted frame, and the features outside the circular dotted frame are basically useless or harmful (which will cause ambiguity in the model).
  • the rectangular dotted frame in Figure 14 is also an area that the model believes needs to be paid attention to after calculation, but it is obvious that this area is not helpful for identifying the dog, so the features corresponding to this area are also invalid, and it is necessary to avoid passing it to the next network layer.
  • Another way of feature fusion can be to set the weight value of a feature to 1 and the other weight values to 0. Taking multiple features including global features and local features as an example, one of the weight values of the global features and the weight values of the local features is 1, and the other weight values are 0.
  • the above feature fusion method can be called hard fusion. Hard fusion is also a weighted summation. The difference from soft is that the weight vector is in one-hot form, that is, only one component is 1, and the rest are 0, such as: [0,1,0].
  • the weight values corresponding to the global features and the local features may be determined by adding the global features and the local features to obtain the added features, and then performing probability estimation based on the added features through a probability estimation module to obtain the weight values corresponding to the global features and the local features.
  • the added features can be obtained by adding the convolution features, global features and local features, and then the probability estimation module can be used to perform probability estimation based on the added features to obtain the weight values corresponding to the convolution features, global features and local features.
  • the addition can be performed in units of each image block (patch), and the added features corresponding to each patch are obtained after each patch is added, thereby forming the final added features.
  • the fully connected layer can be used to reduce the feature dimension to 3, and then the probability estimation module can be used to perform probability estimation to obtain the convolution features, global features and local features.
  • the weight value corresponding to each feature can be used to reduce the feature dimension to 3.
  • the fully connected layer is generally used at the end of the classification network to map the feature dimension to the number of categories, and then connect to the probability estimation module for probability estimation. For example, if the feature dimension of the feature mapping output is 1024, and the categories to be classified are 100, the fully connected layer can map the 1024-length feature to the 100-length feature, and then use the probability estimation module to estimate the probability distribution of the image to be processed in 100 categories, where the category with the largest probability value is the category judged by the network. As shown in Figure 16, the fully connected layer is traversed, and the black part represents the corresponding features (such as the head, feet, body, tail, legs, etc.
  • the probability estimation module will be different depending on the fusion method used.
  • the probability estimation module can be implemented by a softmax function, that is, the probability estimation is performed by a softmax function to obtain a soft probability distribution with a sum of 1, and each probability value can be used as the weight value of the corresponding feature.
  • the softmax function is a normalized exponential function, which is generally used to estimate the probability distribution of categories after the fully connected layer.
  • the softmax function is also used to estimate the probability of different features.
  • the fusion method of features using the softmax function is called soft fusion.
  • the principle example diagram of the softmax function can be seen in Figure 17.
  • yi can be calculated using the following formula: Among them, e z1 , e z2 and e z3 can be intermediate processing parameters of the softmax function.
  • the probability estimation module can be implemented by the gumbel-softmax function. That is, the probability estimation is performed by the gumbel-softmax function, and a one-hot probability distribution is obtained, that is, only one probability value is 1, and the other probability values are 0. Each probability value can be used as the weight value of the corresponding feature.
  • the probability value estimated by gumbel-softmax is a probability distribution whose sum is 1.
  • the probability distribution estimated by gumbel-softmax is in one-hot form, that is, only one probability value is 1, and the rest of the probability values are all 0. It can be understood as an upgraded version of softmax, and its weighted sum is still 1, but all its energy is concentrated on one probability value.
  • the probability values obtained in Figure 15 are 0, 1, and 0 respectively. According to their arrangement order and the arrangement order of multiple features input in Figure 15, it can be determined that the weight value of the convolution feature and the weight value of the local feature are 0, and the weight value of the global feature is 1.
  • the server can obtain the classification result by performing category prediction through the classification module in the image classification model based on the finally obtained image features.
  • the image classification model can also include a fully connected layer.
  • the classification result can be obtained by performing a fully connected calculation on the image features through the fully connected layer, mapping the image features to the classification quantity length, and then performing a prediction based on the image features of the classification quantity length through the classification module to obtain the classification result.
  • the length of the number of categories may be determined according to the specific business. Usually, 1000 is the commonly used number of categories (i.e., the number of categories), so 1*1000 may be used as the length of the number of categories.
  • the classification module may be implemented based on the softmax function. Through the softmax calculation, the probability value in each category is obtained, and then the category with the largest probability value is determined as the classification result.
  • the method provided in the embodiment of the present application is superior to related technologies on the commonly used image classification evaluation data set in the industry, and has stable improvements in model parameters of different scales.
  • the method provided in the embodiment of the present application can also be used in text recognition (Optical Character Recognition, OCR) products.
  • OCR Optical Character Recognition
  • the method provided in the embodiment of the present application has stable improvements in Chinese and English, printed text and handwritten text.
  • Table 1 The comparison of the method provided in the embodiment of the present application and the related technical effects is shown in Table 1:
  • the 2nd to 5th columns from left to right are the index values corresponding to the relevant technologies under the four indicators
  • the 6th to 9th columns are the index values corresponding to the methods provided in the embodiments of the present application under the above four indicators. It can be seen from the index values corresponding to the two schemes under the same indicators that the classification effect of the method provided in the embodiments of the present application is improved compared with the relevant technologies.
  • vectorization processing can be performed based on the image to be processed to obtain the image representation vector of the image to be processed. Then, the image representation vector is feature mapped through the network unit included in the feature mapping module in the image classification model to obtain the image features of the image to be processed.
  • the input content obtained according to the image representation vector is respectively subjected to global feature mapping and local feature mapping through the network layer to obtain global features and local features, and the global features and local features are feature fused through the network layer to obtain the fusion features corresponding to the network layer, and the image features are obtained based on the fusion features corresponding to the network layer. Since the final image features are obtained by fusing multiple features in the same network layer, local features and global features can be learned at the same time, solving the problem caused by not paying attention to local features in the transformer solution, and improving the fusion ability between multiple features.
  • the classification result is obtained by performing category prediction based on the image features through the classification module, since the local features and global features are fused in the image features, even when facing images with similar object appearance features, the image to be processed can be accurately classified, thereby improving the classification ability and classification effect.
  • the image classification model includes an image block embedding module, a position embedding module, a feature mapping module, a fully connected layer and a classification module.
  • the feature mapping module includes 4 network units, such as network unit 1, network unit 2, network unit 3 and network unit 4, each of which is connected to a downsampling module.
  • the classification module can be implemented by the softmax function, as shown in Figure 18.
  • the image to be processed when the image to be processed is obtained, the image to be processed can be cut and divided into multiple image blocks, and multiple patches can be mapped to data structures, so as to map the two-dimensional structure data into one-dimensional structure data suitable for image classification model processing. Then, the one-dimensional structure data is patch-embedded by the image block embedding module to obtain a block vector, and then the position embedding module is pos-embedded to obtain a position vector, and the final image representation vector is obtained based on the position vector and the block vector.
  • the image representation vector is input into the feature mapping module, and the final image feature is obtained by processing through network unit 1, network unit 2, network unit 3, network unit 4, and the corresponding downsampling modules.
  • the method of performing feature mapping in the same network layer of a network unit can be referred to FIG13 and the corresponding introduction, which will not be repeated here.
  • the image feature is fully connected through the fully connected layer, and the image feature is mapped to the classification number length, and then the probability value is estimated by the softmax function based on the image feature of the classification number length.
  • the probability value can be shown in FIG18, which is 0.1, 0.1, 0.7, and 0.1 respectively, and then the classification result is obtained based on the probability value. For example, the category with the largest probability value can be used as the classification result.
  • the embodiment of the present application further provides an image processing device 1900.
  • the image processing device 1900 includes an acquisition unit 1901, a determination unit 1902, a mapping unit 1903 and a prediction unit 1904:
  • the acquisition unit 1901 is used to acquire the image to be processed
  • the determining unit 1902 is used to perform vectorization processing based on the image to be processed to obtain an image representation vector of the image to be processed;
  • the mapping unit 1903 is used to perform feature mapping on the image representation vector through a network unit included in a feature mapping module in an image classification model to obtain image features of the image to be processed;
  • the mapping unit 1903 is specifically configured to, in the process of obtaining the image feature through the network unit, perform global feature mapping on the input content through the network layer to obtain the global feature, and perform local feature mapping on the input content through the network layer to obtain the local feature, wherein the input content is obtained according to the image representation vector; perform feature fusion on the global feature and the local feature through the network layer to obtain the fusion feature corresponding to the network layer; and obtain the image feature based on the fusion feature corresponding to the network layer;
  • the prediction unit 1904 is used to perform category prediction based on the image features through the classification module in the image classification model to obtain the classification result of the image to be processed.
  • mapping unit 1903 is specifically configured to:
  • the global feature and the local feature are weightedly summed to obtain the fused feature.
  • one of the weight value of the global feature and the weight value of the local feature is 1, and the other weight values are 0.
  • mapping unit 1903 is specifically configured to:
  • probability estimation is performed based on the added features to obtain weight values corresponding to the global features and the local features respectively.
  • mapping unit 1903 is further configured to:
  • the global features, the local features and the convolutional features are fused through the network layer to obtain fused features corresponding to the network layer.
  • the feature mapping module includes a plurality of network units, each of which is connected to a downsampling module, the plurality of network units include a first network unit and a second network unit, and the mapping unit 1903 is specifically configured to:
  • the image features of the image to be processed are obtained according to the feature map of the second scale.
  • the determining unit 1902 is specifically configured to:
  • Vectorization is performed on the one-dimensional structure data to obtain the image representation vector.
  • the acquiring unit 1901 is further configured to:
  • the determining unit 1902 is specifically configured to perform vectorization processing on the one-dimensional structure data to obtain a block vector corresponding to each image block; and obtain the image representation vector based on the block vector and position vector corresponding to each image block.
  • the image classification model further includes a fully connected layer
  • the prediction unit 1904 is specifically configured to:
  • category prediction is performed based on the image features of the classification quantity length to obtain the classification result.
  • vectorization processing can be performed based on the image to be processed to obtain an image representation vector of the image to be processed.
  • the image representation vector is feature mapped by the network unit included in the feature mapping module in the image classification model to obtain the image features of the image to be processed.
  • the input content obtained according to the image representation vector is respectively subjected to global feature mapping and local feature mapping through the network layer to obtain the global features and local features.
  • Features are obtained by fusion of global features and local features through the network layer to obtain the fusion features corresponding to the network layer, and image features are obtained based on the fusion features corresponding to the network layer. Since the final image features are obtained by fusion of multiple features in the same network layer, local features and global features can be learned at the same time, solving the problem caused by the lack of attention to local features in the transformer solution, and improving the fusion ability between multiple features. In this way, when the classification module is used to predict the classification results based on the image features, since the local features and global features are fused in the image features, even when facing images with similar object appearance features, the image to be processed can be accurately classified, thereby improving the classification ability and classification effect.
  • the embodiment of the present application further provides a computer device, which can execute the image processing method.
  • the computer device can be, for example, a terminal, taking a smart phone as an example:
  • FIG20 is a block diagram showing a partial structure of a smartphone provided in an embodiment of the present application.
  • the smartphone includes: a radio frequency (full name in English: Radio Frequency, English abbreviation: RF) circuit 2010, a memory 2020, an input unit 2030, a display unit 2040, a sensor 2050, an audio circuit 2060, a wireless fidelity (English abbreviation: WiFi) module 2070, a processor 2080, and a power supply 2090 and other components.
  • the input unit 2030 may include a touch panel 2031 and other input devices 2032
  • the display unit 2040 may include a display panel 2041
  • the audio circuit 2060 may include a speaker 2061 and a microphone 2062.
  • the smartphone structure shown in FIG20 does not constitute a limitation on the smartphone, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.
  • the memory 2020 can be used to store software programs and modules.
  • the processor 2080 executes various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 2020.
  • the memory 2020 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the smartphone (such as audio data, a phone book, etc.), etc.
  • the memory 2020 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the processor 2080 is the control center of the smartphone, which uses various interfaces and lines to connect various parts of the entire smartphone, and executes various functions of the smartphone and processes data by running or executing software programs and/or modules stored in the memory 2020, and calling data stored in the memory 2020.
  • the processor 2080 may include one or more processing units; preferably, the processor 2080 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 2080.
  • the processor 2080 in the smartphone may perform the following steps:
  • the input content is subjected to global feature mapping through the network layer to obtain global features
  • the image features are subjected to global feature mapping through the network layer.
  • the classification module in the image classification model is used to perform category prediction based on the image characteristics to obtain the classification result of the image to be processed.
  • the computer device provided in the embodiment of the present application can also be a server, as shown in FIG. 21, which is a structural diagram of a server 2100 provided in the embodiment of the present application.
  • the server 2100 may have relatively large differences due to different configurations or performances, and may include one or more processors, such as a central processing unit (CPU) 2122, and a memory 2132, one or more storage media 2130 (such as one or more mass storage devices) storing application programs 2142 or data 2144.
  • the memory 2132 and the storage medium 2130 can be temporary storage or permanent storage.
  • the program stored in the storage medium 2130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the server.
  • the central processing unit 2122 can be configured to communicate with the storage medium 2130 and execute a series of instruction operations in the storage medium 2130 on the server 2100.
  • the server 2100 may further include one or more power supplies 2126, one or more wired or wireless network interfaces 2150, one or more input and output interfaces 2158, and/or one or more operating systems 2141, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM , etc.
  • operating systems 2141 such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM , etc.
  • the CPU 2122 in the server 2100 may perform the following steps:
  • the classification module in the image classification model is used to perform category prediction based on the image features to obtain a classification result of the image to be processed.
  • a computer-readable storage medium is provided, wherein the computer-readable storage medium is used to store program codes, and the program codes are used to execute the image processing methods described in the above-mentioned embodiments.
  • a computer program product comprising a computer program, the computer program being stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the method provided in various optional implementations of the above-mentioned embodiments.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store computer programs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开一种图像处理方法和相关装置,可应用于云技术、人工智能、智慧交通、辅助驾驶等各种场景。基于待处理图像进行向量化处理,得到待处理图像的图像表示向量。通过图像分类模型中的网络单元对图像表示向量进行特征映射得到图像特征。在得到图像特征的过程中,在一个网络单元的同一网络层中,通过网络层对根据图像表示向量得到的输入内容进行全局特征映射和局部特征映射,得到全局特征和局部特征,并通过网络层对全局特征和局部特征进行特征融合得到网络层对应的融合特征,基于融合特征得到图像特征。通过分类模块,基于图像特征进行预测得到分类结果。本申请能够准确地对该待处理图像进行分类,提高分类能力和分类效果。

Description

一种图像处理方法和相关装置
本申请要求于2022年09月26日提交中国专利局、申请号为2022111737459、申请名称为“一种图像处理方法和相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别涉及图像处理技术。
背景技术
随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的作用。人工智能技术中的一个重要分支是图像识别技术,利用图像识别技术可以对图像进行分类。图像分类例如可以包括图像内容分类、识别图像中的文字、鉴别图像是否合规等。
目前,采用基于transformer的方案对待处理图像进行图像特征学习,进而基于学习到的图像特征确定分类结果。
然而,若待处理图像中包括的对象与其他图像中包括的对象外形特征相似,则通过上述方式将很难准确地对该待处理图像进行分类,即这种方式对对象外形特征相似的图像分类能力不足,分类效果较差。
发明内容
为了解决上述技术问题,本申请提供了一种图像处理方法和相关装置,能够准确地对该待处理图像进行分类,从而提高分类能力和分类效果。
本申请实施例公开了如下技术方案:
一方面,本申请实施例提供一种图像处理方法,由计算机设备执行,所述方法包括:
获取待处理图像;
基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
一方面,本申请实施例提供一种图像处理装置,所述装置包括获取单元、确定单元、映射单元和预测单元:
所述获取单元,用于获取待处理图像;
所述确定单元,用于基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
所述映射单元,用于通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
所述映射单元,具体用于在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
所述预测单元,用于通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
一方面,本申请实施例提供一种计算机设备,所述计算机设备包括处理器以及存储器:
所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
所述处理器用于根据所述程序代码中的指令执行前述任一方面所述的方法。
一方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码当被处理器执行时使所述处理器执行前述任一方面所述的方法。
一方面,本申请实施例提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现前述任一方面所述的方法。
由上述技术方案可以看出,在需要对待处理图像进行分类时,可以基于待处理图像进行向量化处理,得到待处理图像的图像表示向量。然后通过图像分类模型中的特征映射模块包括的网络单元对图像表示向量进行特征映射,得到待处理图像的图像特征。在通过网络单元得到图像特征的过程中,在一个网络单元的同一网络层中,通过网络层对根据图像表示向量得到的输入内容分别进行全局特征映射和局部特征映射,得到全局特征和局部特征,并通过网络层对全局特征和局部特征进行特征融合,得到网络层对应的融合特征,基于网络层对应的融合特征得到图像特征。由于在同一个网络层中融合多种特征得到最终的图像特征,故能够同时学习局部特征和全局特征,解决transformer的方案中因不注重局部特征而产生的问题,同时提高了多种特征之间的融合能力。这样,在通过分类模块,基于图像特征进行类别预测得到分类结果时,由于图像特征中融合了局部特征和全局特征,即使面对对象外形特征相似的图像,也能够准确地对该待处理图像进行分类,从而提高分类能力和分类效果。
附图说明
图1为相关技术提供的一种图像处理方法的框架图;
图2为本申请实施例提供的一种图像处理方法的应用场景架构图;
图3为本申请实施例提供的一种图像处理方法的流程图;
图4为本申请实施例提供的一种学习语言序列中单词之间的特征关系的示例图;
图5为本申请实施例提供的一种对待处理图像进行切割分块的示例图;
图6为本申请实施例提供的一种对图像块进行Flatten处理的示例图;
图7为本申请实施例提供的一种网络层的结构示例图;
图8为本申请实施例提供的一种包括不同大小对象的待处理图像的示例图;
图9为本申请实施例提供的一种包括多个网络单元的图像分类模型的结构示例图;
图10为本申请实施例提供的一种local-attention的原理示例图;
图11为相关技术提供的一种分层特征融合方案的处理流程示例图;
图12为本申请实施例提供的一种卷积神经网络的原理示例图;
图13为本申请实施例提供的一种确定融合特征的处理流程示例图;
图14为本申请实施例提供的一种有效特征和无效特征的示例图;
图15为本申请实施例提供的一种确定权重值的处理流程示例图;
图16为本申请实施例提供的一种全连接层的原理示例图;
图17为本申请实施例提供的一种softmax函数的原理示例图;
图18为本申请实施例提供的一种图像处理方法的整体框架示例图;
图19为本申请实施例提供的一种图像处理装置的结构图;
图20为本申请实施例提供的一种终端的结构图;
图21为本申请实施例提供的一种服务器的结构图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
图像分类是一个对图像内容进行类别描述的问题,比如图像内容为一头大象,则模型或者算法需要将其识别为大象。图像分类从广义上来说例如可以包括图像内容分类、识别图像中的文字、鉴别图像是否合规等。图像内容分类例如可以是识别图像中包括的对象属于哪个类别,类别例如可以包括鸟、球、车等等。识别图像中的文字例如可以是识别图像中包括的文字是什么,此时可以将每个已知文字作为一个类别,从而识别图像中包括的文字属于哪个类别,进而识别出图像中的文字是什么。鉴别图像是否合规时,类别例如可以是合规、不合规,从而识别图像属于哪个类别。
为了实现对图像进行分类,目前可以采用基于transformer的方案对待处理图像进行图像特征学习,进而基于学习到的图像特征确定分类结果。如图1所示,将待处理图像输入至Transformer框架的编码器(Encoder),可以称为Transformer Encoder,Transformer Encoder可以是包含多个自注意力机制的网络层,通过Transformer Encoder对待处理图像进行特征学习,基于学习得到的图像特征,通过分类器进行分类得到分类结果。在图1中,分类器可以是MLP Head,MLP Head是一个用于分类的层结构,类别例如可以包括鸟、球、车等等,分类结果可以根据在各个类别上输出的概率值确定。
然而,transformer方案中通常只采用全局自注意力(self-attention)方法学习图像特征,该方法不注重图像局部特征的学习,若待处理图像中包括的对象与其他图像中包括的对象外形特征相似,则通过这种方式很难准确地对该待处理图像进行分类,即这种方式对对象外形特征相似的图像分类能力不足,分类效果较差。
为了解决上述技术问题,本申请实施例提供一种图像处理方法,该方法在同一个网络层中融合多种特征以得到最终的图像特征,故能够同时学习局部特征和全局特征,解决transformer的方案中因不注重局部特征而产生的问题,同时提高了多种特征之间的融合能力,解决分层特征融合方案特征融合不智能的问题。这样,在基于图像特征,通过分类模块进行预测得到分类结果时,由于图像特征中融合了局部特征和全局特征,即使面对对象外形特征相似的图像,也能够准确地对该待处理图像进行分类,从而提高分类能力和分类效果。
需要说明的是,本申请实施例提供的图像处理方法适用于各种图像分类场景,包括图像内容分类、识别图像中文字、鉴别图像是否合规、视频分类(此时待处理图像可以是视频中的图像帧,通过对待处理图像进行处理得到视频的分类结果)等。
本申请实施例提供的图像处理方法可以由计算机设备执行,该计算机设备例如可以是服务器,也可以是终端。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。终端包括但不限于智能手机、电脑、智能语音交互设备、智能家电、车载终端、飞行器等。
需要说明的是,本申请实施例提供的方法可应用于云技术、人工智能、智慧交通、辅助驾驶等各种场景。
参见图2,图2示出了本申请实施例提供的一种图像处理方法的应用场景架构图,该应用场景以服务器执行本申请实施例提供的图像处理方法为例进行介绍。
在该应用场景中可以包括服务器200,在需要对待处理图像进行分类时,服务器200可以获取待处理图像。待处理图像为需要进行分类的图像,待处理图像可以是直接拍摄得到的图像、也可以是服务器200存储的已有图像、也可以是对视频进行分帧得到的图像帧,等等。待处理图像中可以包括对象,该对象可以是各种物体、动物、植物、文字等等。
服务器200中可以运行图像分类模型,图像分类模型可以包括特征映射模块和分类模块,特征映射模块中包括网络单元(block),一个网络单元中包括网络层(layer)。这样,服务器200可以通过图像分类模型,基于待处理图像进行向量化处理,得到待处理图像的图像表示向量,然后通过网络单元对图像表示向量进行特征映射,得到待处理图像的图像特征。
服务器200在通过网络单元得到图像特征的过程中,在一个网络单元的同一网络层中,服务器200通过该网络层,对根据图像表示向量得到的输入内容分别进行全局特征映射和局部特征映射,得到全局特征和局部特征,并通过该网络层对全局特征和局部特征进行特征融合,得到该网络层对应的融合特征,基于该网络层对应的融合特征得到图像特征。由于在同一个网络层中融合多种特征以得到最终的图像特征,故能够同时学习局部特征和全局特征,解决transformer的方案中因不注重局部特征而引起的问题,同时提高了多种特征之间的融合能力。
之后,服务器200可以通过分类模块,基于图像特征进行预测得到分类结果。由于图像特征中融合了局部特征和全局特征,因此即使面对对象外形特征相似的图像,也能够准确地对该待处理图像进行分类,从而提高分类能力和分类效果。
需要说明的是,本申请实施例提供的方法主要涉及人工智能领域中的计算机视觉技术(Computer Vision,CV)和机器学习技术(Machine Learning,ML)。
接下来,将以服务器执行图像处理方法为例,结合附图对本申请实施例提供的图像处理方法进行详细介绍。参见图3,图3示出了一种图像处理方法的流程图,所述方法包括:
S301、获取待处理图像。
当需要对某个图像例如待处理图像进行分类时,服务器可以获取该待处理图像。
可以理解的是,在本申请实施例中,服务器获取的待处理图像需要输入至图像分类模型,以便对待处理图像进行分类。而通常情况下,输入至图像分类模型中的图像可以按照业务内容不同需要设置为预设分辨率(比如224*224)。在这种情况下,服务器获取的待处理图像可以是分辨率等于预设分辨率的图像。
然而,在一些情况下,服务器获取到的原始图像的分辨率可能与预设分辨率不一致,此时,服务器获取待处理图像的方式可以是,服务器在获取到原始图像后,可以先确定原始图像的分辨率与预设分辨率是否一致,若一致,服务器可以将原始图像直接作为待处理图像;若不一致,服务器可以对原始图像进行尺寸变化(resize),使其分辨率等于预设分辨率,进而将尺寸变化后的原始图像作为待处理图像。
S302、基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量。
服务器获取到待处理图像后,可以将待处理图像输入至图像分类模型,从而通过图像分类模型基于该待处理图像进行向量化处理,得到图像表示向量。或者,服务器也可以采用相关的图像向量转化模型,对待处理图像进行向量化处理,得到图像表示向量;该图像向量转化模型是用于将图像转化为对应的图像表示向量的模型,其独立于上述图像分类模型。本申请实施例对此不做任何限定。
其中,图像分类模型是一种神经网络模型,例如可以是Transformer模型,也可以是其他能够实现本申请实施例提供的方法的模型,本申请实施例对图像分类模型的结构不做限定。图像分类模型可以是预先训练得到的,在其训练过程中,先获取训练样本,对训练样本的处理过程与本申请实施例中对待处理图像的处理方式类似,不过可能需要基于预测结果对模型进行优化调整,直到得到满足要求的图像分类模型。
Transformer模型是一种基于自注意力机制的语言模型,其核心是自注意力机制(Self-attention),该机制可以学习数据之间的内在依赖关系。如图4所示,图4示出了一种学习语言序列中单词之间的特征关系的示意图,在图4中语言序列可以是“The animal didn’t cross the street because it was too tired.”,针对该语言序列中的每个单词学习单词之间的特征关系,该特征关系可以是Transformer模型利用自注意力机制学习得到的单词(即数据)之间的内在依赖关系。
当将Transformer模型应用到图像领域时,为了能够适应Transformer模型的核心特点,可以将图像(例如待处理图像)切分为若干图像块(patch),再将patch作为图像块序列中的单元,进行图像块序列中单元之间的特征关系学习。参见图5所示,图5中的待处理图像被切分为3*3个patch,将3*3个patch按照顺序排列得到图像块序列,进而学习该图像块序列中patch之间的特征关系。
基于上述Transformer模型在图像领域的应用原理,基于待处理图像进行向量化处理,得到待处理图像的图像表示向量的方式,可以是:按照块尺寸(patch_size)对待处理图像进行切割分块得到多个图像块,对多个图像块进行数据结构映射得到待处理图像的一维结构数据,然后对一维结构数据进行向量化处理得到图像表示向量。
需要说明的是,patch_size是每个图像块的尺寸,patch_size可以是根据实际需求预先设置的。通常情况下,patch_size要能整除预设分辨率,比如预设分辨率为224*224,则patch大小可以为8、14、16等,假如patch_size为16,则待处理图像一共被分割为14*14个patch。在一种可能的实现方式中,本申请实施例可以将待处理图像切割分块为56*56个patch,patch_size为4。
可以理解的是,在对待处理图像进行切割分块后,得到多个patch,此时多个patch构成二维结构数据。而在一种可能的实现方式中,有些图像分类模型尤其是Transformer模型更适合处理一维结构数据,故可以对多个patch进行数据结构映射,从而将二维结构数据映射成适合图像分类模型处理的一维结构数据。
其中,数据结构映射的方式可以包括多种,例如可以是拉平(Flatten)处理,还可以是线性投影(Linear Projection)处理。Flatten可以是针对切割后的每个patch进行一次Flatten,使其从二维结构数据映射成适合图像分类模型处理的一维结构数据。参见图6所示,图6以待处理图像被切割分块得到3*3个patch为例,参见图6中601所示,其中,每个patch可以用数字标识,分别是0、1、2、……、7、8,将每个patch进行一次Flatten,从而将3*3个patch按照顺序排成一个序列,得到一维结构数据,参见图6中602所示。
在本申请实施例中,配置patch数量为56*56,单个patch大小为4*4,且输入图像为3通道彩色图像,则patch的二维结构数据可以为[3,4,4],进行Flatten处理后映射为一维结构数据[128],该一维结构数据的长度可以根据业务进行自行设定。
LinearProjection处理可以是指将C*H*W(通道数*高度*宽度)的多维结构数据映射为一维结构数据,以便进行后续学习。
需要说明的是,由于对于任何一门语言,字符在句子中的位置以及排列顺序均是非常重要的,它们不仅是一个句子的语法结构的组成部分,更是表达语义的重要概念。一个字符在句子的位置或排列顺序不同,可能整个句子的意思就发生了偏差。因此,当Transformer模型应用在语言模型中时,确定能够反映单词在句子中位置的位置编码是一种非常重要的策略。基于此,当将Transformer模型应用图像领域,即应用到本申请实施例提供的方法时,仍需要参考位置编码进行图像分类,即图像块的位置编码对于图像分类任务来说仍然是重要的一种策略。
为此,在一种可能的实现方式中,还可以获取多个图像块各自对应的位置向量,这样,在对一维结构数据进行向量化处理,得到图像表示向量时,可以对一维结构数据进行向量化处理,得到每个图像块对应的块向量,进而基于每个图像块对应的块向量(通过图像块嵌入(patch embedding)模块进行patch embedding得到)和位置向量(通过位置嵌入(pos embedding)模块进行pos embedding得到),得到图像表示向量。
考虑图像块在待处理图像的位置,根据位置向量得到图像表示向量,从而可以在后续 进行特征映射时得到图像块在待处理图像的位置信息,使得后续分类所依据的信息更加丰富,提高分类能力和分类准确性。
应理解,本申请实施例中,图像分类模型包括特征映射模块和分类模块,特征映射模块中可以包括网络单元(block),一个网络单元中可以包括网络层(layer)。
在本申请实施例中,一个block可以包括至少一个layer,神经网络需要由浅到深反复多次的进行特征映射,每一次特征映射可以称为一个layer,参见图7所示,图7示出了一种包括4个layer的神经网络。
特征映射模块中可以包括至少一个block。在一些情况下,待处理图像中的对象可能有大有小,参见图8所示,图8中(a)、(b)、(c)所示的待处理图像中对象为鸟,(a)、(b)、(c)所示的待处理图像中的小鸟大小不同。为了可以适应不同大小的对象,避免在单个尺度下学习造成信息丢失,另外由于人类观察识别待处理图像也会有从整体到局部、或者从局部到整体的多尺度过程,因此在本申请实施例中,通常使用的特征映射模块中可以包括多个block,从而使针对待处理图像的识别学习可以在不同尺度下进行,以便学习不同尺度的特征,其中每个block对应一个尺度。比如待处理图像的分辨率是224*224,那么可以在56*56尺度、24*24尺度,14*14尺度下分别学习特征。
基于上述对block和layer的介绍,为了提高图像分类模型的分类能力,在一种可能的实现方式中,图像分类模型的特征映射模块中可以包括多个block,单个block内包含多个layer。一个典型的图像分类模型的结构如图9所示:这个图像分类模型包含四个网络单元,每个block处理的图像分辨率分别是1/4、1/8、1/16、1/32。其中每个block根据自身网路特性可以包含不同数量的layer。
S303、通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征。
服务器可以通过特征映射模块对图像表示向量进行特征映射,得到待处理图像的图像特征,具体可以通过特征映射模块中包括的网络单元对图像表示向量进行特征映射,得到待处理图像的图像特征。
在一种可能的实现方式中,为了实现在不同尺度下学习图像特征,特征映射模块中可以包括多个网络单元,每个网络单元后连接一个下采样(Down-sample)模块,其中,每个网络单元用于进行特征映射得到对应尺度下的特征图,下采样模块用于将所连接的网络单元输出的特征图的尺度减少,从而得到另一尺度的特征图,以便下一个网络单元可以在另一尺度学习特征得到特征图。其中,网络单元的数量可以根据实际业务需求确定,本申请实施例对网络单元的数量不做限定。接下来,将以特征映射模块中包括两个网络单元(例如第一网络单元和第二网络单元)为例,对通过多个网络单元在不同尺度下学习图像特征的过程进行介绍。
在这种情况下,可以通过第一网络单元对图像表示向量进行特征映射得到待处理图像的第一特征图,然后通过第一网络单元连接的下采样模块对第一特征图进行下采样处理,得到第一尺度的特征图。接着,通过第二网络单元对第一尺度的特征图进行特征映射得到待处理图像的第二特征图,再通过第二网络单元连接的下采样模块对第二特征图进行下采 样处理,得到第二尺度的特征图,进而根据第二尺度的特征图得到待处理图像的图像特征。当特征映射模块中还包括其他网络单元时,在得到第二尺度的特征图后可以按照上述方法依次类推,直到最后一个网络单元及其连接的下采样模块完成处理,得到最终的图像特征。
其中,特征图(feature map)还可以称为特征映射图,对数据(例如本申请实施例的图像表示向量)进行某种方法的处理/映射后得到的数据一般称为特征图。神经网络就是一种将高维度的数据经过多次特征映射,映射为低维度数据的算法系统。一幅尺寸为1024*1024的彩色图像(例如待处理图像)其原始维度为3*1024*1024,假如设定了100个类别,那么经过层层特征映射,最后会变成100维度,其中每一个维度的数值对应一个类别的概率值,概率值大小反映了该待处理图像被判定为该类别的概率大小。
具体的,特征映射模块由1~n个block组成(典型是4个block,分别是block1、block2、block3、block4),其中每个block又包含数量不等且结构相同的layer。为了实现在不同尺度学习图像特征,每个block后需要接一个下采样模块使特征图的尺寸(size)下降一倍。以4个block为例,得到图像特征的过程中size变换如下:待处理图像的尺寸为3*224*224,切割分块为patch后的尺寸3*56*56*4*4,经过patch embedding和pos embedding后尺寸为56*56*128,block1处理后通过下采样模块下采样后特征图的尺寸为28*28*256,block2处理后通过下采样模块下采样后特征图的尺寸为14*14*512,block3处理后通过下采样模块下采样后特征图的尺寸为7*7*1024,block4处理后通过下采样模块下采样后特征图的尺寸为1*1024。
本申请实施例通过使用多个网络单元可以在多个尺度上学习图像特征,从而可以对待处理图像中不同尺度大小的对象进行准确识别,提高分类识别能力。
S304、在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征。
在本申请实施例中,layer内对输入内容的处理方式是本申请实施例的核心内容。在本申请实施例中,可以在同一网络层对输入内容采用多种方式进行特征提取得到多种特征,从而将多种特征进行特征融合得到融合特征,进而基于该网络层对应的融合特征得到最终的图像特征。例如,可以通过网络层对输入内容进行全局特征映射得到全局特征,以及通过网络层对输入内容进行局部特征映射得到局部特征,进而将全局特征和局部特征进行特征融合,得到网络层对应的融合特征,接着基于网络层对应的融合特征得到图像特征。
其中,输入内容可以是基于图像表示向量得到的。当该网络层是第一个网络单元中的第一个网络层时,其输入内容可以是图像表示向量;当该网络层是其他网络层时,由于图像表示向量已经经由该网络层之前的网络层处理,故该网络层的输入内容可以是其之前网络层处理得到的结果,例如可以是经其之前的网络层处理得到的特征图。
全局特征映射可以通过任何能够得到全局特征的方法实现,例如,全局特征映射可以通过全局注意力(global-attention)机制实现,全局注意力机制可以是自注意力 (self-attention)。self-attention的每个单元都会和其余全部单元进行特征关系的学习,故self-attention可以是一种全局注意力机制。参见图5所示,图5中每个图像块可以作为一个单元,从而针对每个图像块,对该图像块与其余全部图像块进行特征关系的学习。
局部特征映射可以通过任何能够得到局部特征的方法实现,例如局部特征映射可以通过局部注意力(local-attention)机制实现。某些情况下,序列中包括的单元数量特别大,为了降低计算资源消耗,会将序列进行分组,在每个组内使用self-attention,并且各组之间模型参数共享,这种方法只在某个局部区域内学习特征关系的机制称为local-attention。如图10所示,将整个图像(例如待处理图像)分为四块,分别可以参见图10中1001、1002、1003、1004所示,每块之内又划分成多个图像块,例如1001所示的块划分成4*4个图像块,其他三个块类似,针对每个块内的多个图像块使用self-attention。
需要说明的是,与相关技术中的分层特征融合方案相比,分层特征融合方案使用两种方式学习图像特征:global-attention、local-attention,每一个网络层专门学习一种特征,不同layer交替使用不同的特征。如图11所示,网络层1使用全局注意力机制学习全局特征,网络层2使用局部注意力机制学习局部特征。由于local-attention和global-attention交替使用,使得两种图像特征之间融合能力较弱;另外,在不同空间位置上所依赖的语义范围不同,有的特征只是局部特征,有的特征是全局特征,分层特征融合方案每次特征学习都是全局特征或者局部特征,结果就是学习局部特征的网络层会忽略全局信息,而学习全局特征的网络层会忽略局部信息。
而本申请实施例可以在同一网络层内得到多种特征例如全局特征和局部特征,在同一个层中融合多种特征,解决transformer不注重局部特征的问题,同时解决分层特征融合方案中特征融合不智能、特征融合能力弱、学习特征不全面的问题。
需要说明的是,本申请实施例在同一网络层学习的多种特征除了可以包括局部特征和全局特征之外,还可以包括其他特征,本申请实施例对同一网络层学习的多种特征的数量和种类不做限定。在一种可能的实现方式中,同一网络层学习的多种特征还可以包括卷积特征。具体的,可以通过网络层对图像表示向量进行卷积特征映射得到卷积特征,然后通过网络层对全局特征、局部特征和卷积特征进行特征融合,得到网络层对应的融合特征。
其中,卷积特征映射可以是通过卷积神经网络(Convolutional Neural Networks,CNN)实现的,卷积神经网络是一类包含卷积或相关计算且具有深度结构的前馈神经网络。其核心是卷积算子,卷积算子通过卷积核处理图像得到的特征称为卷积特征(即CNN特征),CNN特征是一种学习局部依赖关系的特征。参见图12所示,1201为待处理图像,通过1202所示的卷积核进行处理,可以得到1203所示的卷积特征。从图12中可以看到每个特征单元(token)和它周围8个单元产生关系。
基于上述介绍,以在同一网络层得到上述三种特征为例,确定融合特征的处理流程示例图可以参见图13所示。在图13中,针对待处理图像,通过卷积神经网络得到卷积特征,通过全局注意力机制得到全局特征,通过局部注意力机制得到局部特征,将卷积特征、全局特征和局部特征进行特征融合得到融合特征。
可以理解的是,在网络层进行特征融合得到融合特征的方式可以包括多种,在一种可 能的实现方式中,特征融合的方式可以是对多种特征进行加权求和。通常情况下,多种特征各自对应的权重值可以基于特征进行计算得到,也可以是预先设置。在一种可能的情况下,多种特征的权重值之和可以为1。
以同一网络层得到的多种特征包括全局特征和局部特征为例,通过网络层对全局特征和局部特征进行特征融合,得到网络层对应的融合特征的方式可以是,确定全局特征和局部特征各自对应的权重值,全局特征的权重值和局部特征的权重值之和为1,进而根据权重值对全局特征和局部特征进行加权求和,得到融合特征。
当多种特征中还包括卷积特征时,则确定卷积特征、全局特征和局部特征各自对应的权重值,卷积特征的权重值、全局特征的权重值和局部特征的权重值之和为1,进而根据权重值对卷积特征、全局特征和局部特征进行加权求和,得到融合特征。
需要说明的是,基于多种特征的权重值分布的不同,特征融合的方式也可以有所不同。一种是soft方式的融合,即使用一个总和为1的向量(每个向量作为一个特征对应的权重值)对多个特征进行加权求和,比如多个特征包括三个特征,则多个权重值对应的向量可以表示为[0.2,0.5,0.3],又如多个特征包括两个特征,则多个权重值对应的向量可以表示为[0.4,0.6]。
然而,在一些情况下,通过同一网络层得到的多个特征中,有些特征可能是无效的或者有害的特征,参见图14所示,图14展示了什么是有效特征和无效特征,该待处理图像的分类目标为图像中的狗,对识别为狗有帮助的图像特征基本都在圆形虚线框内,圆形虚线框外的特征基本是没用的或者有害的(会使模型产生歧义)。图14中矩形虚线框也是模型经过计算后认为需要关注的一个区域,但是显然这个区域对识别狗没有帮助,故该区域对应的特征也是无效的,需要避免将其传递给下一个网络层。
为了避免带入无效的或者有害的特征到下一个网络层,即避免将一些无效的或者有害的特征向后传递,可以将无效的或者有害的特征进行完全的抛弃,使其不会传递给下一个网络层。基于此,另一种特征融合的方式可以是将一种特征的权重值设为1,其余权重值设置为0。以多种特征包括全局特征和局部特征为例,则全局特征的权重值和局部特征的权重值中一个权重值为1,其余权重值为0。上述特征融合方式可以称为hard方式的融合,hard方式的融合也是一种加权求和,和soft区别是加权向量是one-hot形式,即只有一个分量是1,其余都是0,比如:[0,1,0]。
在一种可能的实现方式中,确定全局特征和局部特征各自对应的权重值的方式可以是,根据全局特征和局部特征进行相加得到相加特征,进而通过概率估计模块,基于相加特征进行概率估计,得到全局特征和局部特征各自对应的权重值。
当多种特征中还包括卷积特征时,则可以是根据卷积特征、全局特征和局部特征进行相加得到相加特征,进而通过概率估计模块,基于相加特征进行概率估计,得到卷积特征、全局特征和局部特征各自对应的权重值。具体可以参见图15所示,根据卷积特征、全局特征和局部特征进行相加时,可以以每个图像块(patch)为单位进行相加,每个patch相加后得到每个patch对应的相加特征,进而构成最终的相加特征。之后可以使用全连接层,将特征维度降为3,再使用概率估计模块进行概率估计,得到卷积特征、全局特征和局部 特征各自对应的权重值。
其中,全连接层(Fully connected Layer)一般用在分类网络的最后,将特征维度映射为类别数量,然后再接概率估计模块进行概率估计。比如,特征映射最后输出的特征维度是1024,需要分类的类别为100类,则全连接层可以将1024长度的特征映射为100长度的特征,然后再用概率估计模块估计该待处理图像在100个类别上的概率分布,其中概率值最大的类别即为网络判断的类别。参见图16所示,进行全连接层遍历,黑色部分则代表找到对应特征(例如图16中猫的头部、脚、身体、尾巴、腿等),将图中的特征进行组合并输出到输出层,再进行分类得出结论,这是只猫。可以理解的是,将全连接层用在权重值确定中也是依据类似的原理,此处不再详细赘述。
可以理解的是,根据使用的融合方式的不同,概率估计模块也会有所不同。当采用soft方式的融合时,概率估计模块可以通过softmax函数实现,即通过softmax函数进行概率估计,得到和为1的soft的概率分布,每个概率值可以作为对应特征的权重值。softmax函数是一种归一化指数函数,一般用在全连接层后进行类别概率分布估计。在本申请实施例中对各种特征进行特征融合时,也采用softmax函数对不同的特征进行概率估计,使用softmax函数进行特征的融合方式称为soft融合。softmax函数的原理示例图可以参见图17所示,输入z1、z2、z3,通过内部处理可以输出y1、y2、y3,y1=0.88、y2=0.12、y3=0。其中,1>yi>0,i=1、2、3,y1、y2和y3之和为1。
具体的,yi可以采用以下公式进行计算: 其中,ez1、ez2和ez3可以为softmax函数的中间处理参数。
当采用hard方式的融合时,概率估计模块可以通过gumbel-softmax函数实现,即通过gumbel-softmax函数进行概率估计,则得到one-hot形式的概率分布,即只有一个概率值为1,其余概率值为0,每个概率值可以作为对应特征的权重值。
gumbel-softmax估计的概率值是加和为1的概率分布,gumbel-softmax估计的概率分布为one-hot形式,即只有一个概率值为1,其余概率值全部为0,可以理解为它是softmax的升级版,其加权和仍然为1,只是其能量全部集中在一个概率值上。
参见图15所示,在图15中得到的概率值分别是0、1、0,按照其排列顺序以及图15中输入的多个特征的排列顺序,可以确定卷积特征的权重值和局部特征的权重值为0,全局特征的权重值为1。
S305、通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
服务器可以基于最终得到的图像特征,通过图像分类模型中的分类模块进行类别预测得到分类结果。在一种可能的实现方式中,图像分类模型还可以包括全连接层,此时得到分类结果的方式可以是通过全连接层对图像特征进行全连接计算,将图像特征映射为分类数量长度,进而基于分类数量长度的图像特征,通过分类模块进行预测,得到分类结果。
其中,分类数量长度可以是依具体业务而定,通常情况下,1000为常用的分类数量(即类别的数量),故可以将1*1000作为分类数量长度。分类模块可以是基于softmax函数实现,通过softmax计算,得到在每个类别的概率值,进而将概率值最大的类别确定为分类结果。
本申请实施例提供的方法在业内通用的图像分类评测数据集上优于相关技术,在不同规模的模型参数上均有稳定提升。本申请实施例提供的方法也可以用在文字识别(Optical Character Recognition,OCR)产品中,本申请实施例提供的方法在中英文、印刷体和手写体上均有稳定提升。本申请实施例提供的方法与相关技术效果对比如表1所示:
表1
在表1中,从左至右数第2列-第5列为相关技术在四种指标下分别对应的指标值,第6列-第9列为本申请实施例提供的方法在上述四种指标下分别对应的指标值,从相同指标下两种方案对应的指标值可以看出,本申请实施例提供的方法与相关技术相比,分类效果有所提升。
由上述技术方案可以看出,在需要对待处理图像进行分类时,可以基于待处理图像进行向量化处理,得到待处理图像的图像表示向量。然后通过图像分类模型中的特征映射模块包括的网络单元,对图像表示向量进行特征映射,得到待处理图像的图像特征。在通过网络单元得到图像特征的过程中,在一个网络单元的同一网络层中,通过网络层对根据图像表示向量得到的输入内容分别进行全局特征映射和局部特征映射,得到全局特征和局部特征,并通过网络层对全局特征和局部特征进行特征融合,得到网络层对应的融合特征,基于网络层对应的融合特征得到图像特征。由于在同一个网络层中融合多种特征得到最终的图像特征,故能够同时学习局部特征和全局特征,解决transformer的方案中因不注重局部特征而产生的问题,同时提高了多种特征之间的融合能力。这样,在通过分类模块,基于图像特征进行类别预测得到分类结果时,由于图像特征中融合了局部特征和全局特征,即使面对对象外形特征相似的图像,也能够准确地对该待处理图像进行分类,从而提高分类能力和分类效果。
基于前述对本申请实施例提供的方法的介绍,下面将以特定结构的图像分类模型为例,对本申请实施例提供的方法进行介绍。图像分类模型包括图像块嵌入模块、位置嵌入模块、特征映射模块、全连接层和分类模块。其中,特征映射模块中包括4个网络单元,例如网络单元1、网络单元2、网络单元3和网络单元4,每个网络单元后连接一个下采样模块, 分类模块可以通过softmax函数实现,参见图18所示。
基于图18所示的图像分类模型,当获取到待处理图像时,可以对待处理图像进行切割分块得到多个图像块,对多个patch进行数据结构映射,从而将二维结构数据映射成适合图像分类模型处理的一维结构数据。然后通过图像块嵌入模块对一维结构数据进行patch embedding,得到块向量,再通过位置嵌入模块进行pos embedding得到位置向量,基于位置向量和块向量得到最终的图像表示向量。将图像表示向量输入到特征映射模块,通过网络单元1、网络单元2、网络单元3、网络单元4,以及各自对应的下采样模块进行处理得到最终的图像特征。需要说明的是,在一个网络单元的同一网络层中进行特征映射的方式可以参见图13及对应的介绍,此处不再赘述。接着将图像特征通过全连接层进行全连接计算,将图像特征映射为分类数量长度,进而基于分类数量长度的图像特征,通过softmax函数估计概率值,概率值例如可以参见图18所示,分别是0.1、0.1、0.7、0.1,进而基于概率值得到分类结果。例如可以将概率值最大的类别作为分类结果。
需要说明的是,本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
基于图3对应实施例提供的图像处理方法,本申请实施例还提供一种图像处理装置1900。参见图19,所述图像处理装置1900包括获取单元1901、确定单元1902、映射单元1903和预测单元1904:
所述获取单元1901,用于获取待处理图像;
所述确定单元1902,用于基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
所述映射单元1903,用于通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
所述映射单元1903,具体用于在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
所述预测单元1904,用于通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
在一种可能的实现方式中,所述映射单元1903,具体用于:
确定所述全局特征和所述局部特征各自对应的权重值,所述全局特征的权重值和所述局部特征的权重值之和为1;
根据所述权重值,对所述全局特征和所述局部特征进行加权求和,得到所述融合特征。
在一种可能的实现方式中,所述全局特征的权重值和所述局部特征的权重值中一个权重值为1,其余权重值为0。
在一种可能的实现方式中,所述映射单元1903,具体用于:
根据所述全局特征和所述局部特征进行相加,得到相加特征;
通过概率估计模块,基于所述相加特征进行概率估计,得到所述全局特征和所述局部特征各自对应的权重值。
在一种可能的实现方式中,所述映射单元1903,还用于:
通过所述网络层对所述图像表示向量进行卷积特征映射,得到卷积特征;
通过所述网络层对所述全局特征、所述局部特征和所述卷积特征进行特征融合,得到所述网络层对应的融合特征。
在一种可能的实现方式中,所述特征映射模块中包括多个网络单元,每个网络单元后连接一个下采样模块,所述多个网络单元包括第一网络单元和第二网络单元,所述映射单元1903,具体用于:
通过所述第一网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的第一特征图;
通过所述第一网络单元连接的下采样模块对所述第一特征图进行下采样处理,得到第一尺度的特征图;
通过所述第二网络单元对所述第一尺度的特征图进行特征映射,得到所述待处理图像的第二特征图;
通过所述第二网络单元连接的下采样模块对所述第二特征图进行下采样处理,得到第二尺度的特征图;
根据所述第二尺度的特征图得到所述待处理图像的图像特征。
在一种可能的实现方式中,所述确定单元1902,具体用于:
按照块尺寸对所述待处理图像进行切割分块,得到多个图像块;
对所述多个图像块进行数据结构映射,得到所述待处理图像的一维结构数据;
对所述一维结构数据进行向量化处理,得到所述图像表示向量。
在一种可能的实现方式中,所述获取单元1901,还用于:
获取所述多个图像块各自对应的位置向量;
所述确定单元1902,具体用于对所述一维结构数据进行向量化处理,得到每个图像块对应的块向量;基于每个图像块对应的块向量和位置向量,得到所述图像表示向量。
在一种可能的实现方式中,所述图像分类模型还包括全连接层,所述预测单元1904,具体用于:
通过所述全连接层对所述图像特征进行全连接计算,将所述图像特征映射为分类数量长度;
通过所述分类模块,基于分类数量长度的图像特征进行类别预测,得到所述分类结果。
由上述技术方案可以看出,在需要对待处理图像进行分类时,可以基于待处理图像进行向量化处理,得到待处理图像的图像表示向量。然后通过图像分类模型中的特征映射模块包括的网络单元对图像表示向量进行特征映射,得到待处理图像的图像特征。在通过网络单元得到图像特征的过程中,在一个网络单元的同一网络层中,通过网络层对根据图像表示向量得到的输入内容分别进行全局特征映射和局部特征映射,得到全局特征和局部特 征,并通过网络层对全局特征和局部特征进行特征融合,得到网络层对应的融合特征,基于网络层对应的融合特征得到图像特征。由于在同一个网络层中融合多种特征得到最终的图像特征,故能够同时学习局部特征和全局特征,解决transformer的方案中因不注重局部特征而产生的问题,同时提高了多种特征之间的融合能力。这样,在通过分类模块,基于图像特征进行预测得到分类结果时,由于图像特征中融合了局部特征和全局特征,即使面对对象外形特征相似的图像,也能够准确地对该待处理图像进行分类,从而提高分类能力和分类效果。
本申请实施例还提供了一种计算机设备,该计算机设备可以执行图像处理方法。该计算机设备例如可以是终端,以终端为智能手机为例:
图20示出的是与本申请实施例提供的智能手机的部分结构的框图。参考图20,智能手机包括:射频(英文全称:Radio Frequency,英文缩写:RF)电路2010、存储器2020、输入单元2030、显示单元2040、传感器2050、音频电路2060、无线保真(英文缩写:WiFi)模块2070、处理器2080、以及电源2090等部件。输入单元2030可包括触控面板2031以及其他输入设备2032,显示单元2040可包括显示面板2041,音频电路2060可以包括扬声器2061和传声器2062。可以理解的是,图20中示出的智能手机结构并不构成对智能手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
存储器2020可用于存储软件程序以及模块,处理器2080通过运行存储在存储器2020的软件程序以及模块,从而执行智能手机的各种功能应用以及数据处理。存储器2020可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据智能手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器2020可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器2080是智能手机的控制中心,利用各种接口和线路连接整个智能手机的各个部分,通过运行或执行存储在存储器2020内的软件程序和/或模块,以及调用存储在存储器2020内的数据,执行智能手机的各种功能和处理数据。可选的,处理器2080可包括一个或多个处理单元;优选的,处理器2080可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器2080中。
在本实施例中,智能手机中的处理器2080可以执行以下步骤:
获取待处理图像;
基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述 输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
通过所述图像分类模型中的分类模块,基于所述图像特进行类别预测,得到所述待处理图像的分类结果。
本申请实施例提供的计算机设备还可以是服务器,请参见图21所示,图21为本申请实施例提供的服务器2100的结构图,服务器2100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器,例如中央处理器(Central Processing Units,简称CPU)2122,以及存储器2132,一个或一个以上存储应用程序2142或数据2144的存储介质2130(例如一个或一个以上海量存储设备)。其中,存储器2132和存储介质2130可以是短暂存储或持久存储。存储在存储介质2130的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器2122可以设置为与存储介质2130通信,在服务器2100上执行存储介质2130中的一系列指令操作。
服务器2100还可以包括一个或一个以上电源2126,一个或一个以上有线或无线网络接口2150,一个或一个以上输入输出接口2158,和/或,一个或一个以上操作系统2141,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
在本实施例中,服务器2100中的中央处理器2122可以执行以下步骤:
获取待处理图像;
基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
根据本申请的一个方面,提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码用于执行前述各个实施例所述的图像处理方法。
根据本申请的一个方面,提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行上述实施例各种可选实现方式中提供的方法。
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储计算机程序的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术成员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (15)

  1. 一种图像处理方法,由计算机设备执行,所述方法包括:
    获取待处理图像;
    基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
    通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
    在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
    通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
  2. 根据权利要求1所述的方法,所述通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征,包括:
    确定所述全局特征和所述局部特征各自对应的权重值,所述全局特征的权重值和所述局部特征的权重值之和为1;
    根据所述权重值,对所述全局特征和所述局部特征进行加权求和,得到所述融合特征。
  3. 根据权利要求2所述的方法,所述全局特征的权重值和所述局部特征的权重值中一个权重值为1,其余权重值为0。
  4. 根据权利要求2或3所述的方法,所述确定所述全局特征和所述局部特征各自对应的权重值,包括:
    根据所述全局特征和所述局部特征进行相加,得到相加特征;
    通过概率估计模块,基于所述相加特征进行概率估计,得到所述全局特征和所述局部特征各自对应的权重值。
  5. 根据权利要求1所述的方法,所述方法还包括:
    通过所述网络层对所述图像表示向量进行卷积特征映射,得到卷积特征;
    所述通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征,包括:
    通过所述网络层对所述全局特征、所述局部特征和所述卷积特征进行特征融合,得到所述网络层对应的融合特征。
  6. 根据权利要求1-5任一项所述的方法,所述特征映射模块中包括多个网络单元,每个网络单元后连接一个下采样模块,所述多个网络单元包括第一网络单元和第二网络单元,所述通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征,包括:
    通过所述第一网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的第一特征图;
    通过所述第一网络单元连接的下采样模块对所述第一特征图进行下采样处理,得到第一尺度的特征图;
    通过所述第二网络单元对所述第一尺度的特征图进行特征映射,得到所述待处理图像的第二特征图;
    通过所述第二网络单元连接的下采样模块对所述第二特征图进行下采样处理,得到第二尺度的特征图;
    根据所述第二尺度的特征图得到所述待处理图像的图像特征。
  7. 根据权利要求1-6任一项所述的方法,所述基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量,包括:
    按照块尺寸对所述待处理图像进行切割分块,得到多个图像块;
    对所述多个图像块进行数据结构映射,得到所述待处理图像的一维结构数据;
    对所述一维结构数据进行向量化处理,得到所述图像表示向量。
  8. 根据权利要求7所述的方法,所述方法还包括:
    获取所述多个图像块各自对应的位置向量;
    所述对所述一维结构数据进行向量化处理,得到所述图像表示向量,包括:
    对所述一维结构数据进行向量化处理,得到每个图像块对应的块向量;
    基于每个图像块对应的块向量和位置向量,得到所述图像表示向量。
  9. 根据权利要求1-7任一项所述的方法,所述图像分类模型还包括全连接层,所述通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果,包括:
    通过所述全连接层对所述图像特征进行全连接计算,将所述图像特征映射为分类数量长度;
    通过所述分类模块,基于分类数量长度的图像特征进行类别预测,得到所述分类结果。
  10. 一种图像处理装置,所述装置包括获取单元、确定单元、映射单元和预测单元:
    所述获取单元,用于获取待处理图像;
    所述确定单元,用于基于所述待处理图像进行向量化处理,得到所述待处理图像的图像表示向量;
    所述映射单元,用于通过图像分类模型中的特征映射模块包括的网络单元对所述图像表示向量进行特征映射,得到所述待处理图像的图像特征;
    所述映射单元,具体用于在通过所述网络单元得到所述图像特征的过程中,在所述网络单元的同一网络层中,通过所述网络层对输入内容进行全局特征映射得到全局特征,以及通过所述网络层对所述输入内容进行局部特征映射得到局部特征,所述输入内容是根据所述图像表示向量得到的;通过所述网络层对所述全局特征和所述局部特征进行特征融合,得到所述网络层对应的融合特征;基于所述网络层对应的融合特征得到所述图像特征;
    所述预测单元,用于通过所述图像分类模型中的分类模块,基于所述图像特征进行类别预测,得到所述待处理图像的分类结果。
  11. 根据权利要求10所述的装置,所述映射单元具体用于:
    确定所述全局特征和所述局部特征各自对应的权重值,所述全局特征的权重值和所述局部特征的权重值之和为1;
    根据所述权重值,对所述全局特征和所述局部特征进行加权求和,得到所述融合特征。
  12. 根据权利要求11所述的装置,所述映射单元具体用于:
    根据所述全局特征和所述局部特征进行相加,得到相加特征;
    通过概率估计模块,基于所述相加特征进行概率估计,得到所述全局特征和所述局部特征各自对应的权重值。
  13. 一种计算机设备,所述计算机设备包括处理器以及存储器:
    所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
    所述处理器用于根据所述程序代码中的指令执行权利要求1-9任一项所述的方法。
  14. 一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码当被处理器执行时使所述处理器执行权利要求1-9任一项所述的方法。
  15. 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现权利要求1-9任一项所述的方法。
PCT/CN2023/108785 2022-09-26 2023-07-24 一种图像处理方法和相关装置 WO2024066697A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211173745.9A CN117011569A (zh) 2022-09-26 2022-09-26 一种图像处理方法和相关装置
CN202211173745.9 2022-09-26

Publications (1)

Publication Number Publication Date
WO2024066697A1 true WO2024066697A1 (zh) 2024-04-04

Family

ID=88566108

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/108785 WO2024066697A1 (zh) 2022-09-26 2023-07-24 一种图像处理方法和相关装置

Country Status (2)

Country Link
CN (1) CN117011569A (zh)
WO (1) WO2024066697A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359696A (zh) * 2018-10-29 2019-02-19 重庆中科云丛科技有限公司 一种车款识别方法、系统及存储介质
US10552664B2 (en) * 2017-11-24 2020-02-04 International Business Machines Corporation Image feature classification and localization using discriminative representations for robotic surgical control
CN112699855A (zh) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 基于人工智能的图像场景识别方法、装置及电子设备
CN112784856A (zh) * 2021-01-29 2021-05-11 长沙理工大学 胸部x射线图像的通道注意力特征提取方法和识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10552664B2 (en) * 2017-11-24 2020-02-04 International Business Machines Corporation Image feature classification and localization using discriminative representations for robotic surgical control
CN109359696A (zh) * 2018-10-29 2019-02-19 重庆中科云丛科技有限公司 一种车款识别方法、系统及存储介质
CN112784856A (zh) * 2021-01-29 2021-05-11 长沙理工大学 胸部x射线图像的通道注意力特征提取方法和识别方法
CN112699855A (zh) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 基于人工智能的图像场景识别方法、装置及电子设备

Also Published As

Publication number Publication date
CN117011569A (zh) 2023-11-07

Similar Documents

Publication Publication Date Title
CN113039563A (zh) 学习生成用于训练神经网络的合成数据集
CN111079570A (zh) 一种人体关键点识别方法、装置及电子设备
JP7395617B2 (ja) 三次元メッシュモデルの再構築方法およびその装置、機器、並びに記憶媒体
CN113762309B (zh) 对象匹配方法、装置及设备
CN110222572A (zh) 跟踪方法、装置、电子设备及存储介质
CN112668608B (zh) 一种图像识别方法、装置、电子设备及存储介质
CN115393854B (zh) 一种视觉对齐处理方法、终端及存储介质
CN112200041A (zh) 视频动作识别方法、装置、存储介质与电子设备
CN115223020B (zh) 图像处理方法、装置、设备、存储介质及计算机程序产品
CN114861842B (zh) 少样本目标检测方法、装置和电子设备
CN114219971A (zh) 一种数据处理方法、设备以及计算机可读存储介质
CN114549369B (zh) 数据修复方法、装置、计算机及可读存储介质
CN112348056A (zh) 点云数据分类方法、装置、设备及可读存储介质
CN116740422A (zh) 基于多模态注意力融合技术的遥感图像分类方法及装置
WO2022100607A1 (zh) 一种神经网络结构确定方法及其装置
CN117115900B (zh) 一种图像分割方法、装置、设备及存储介质
CN117058723A (zh) 掌纹识别方法、装置及存储介质
CN111914809A (zh) 目标对象定位方法、图像处理方法、装置和计算机设备
WO2024066697A1 (zh) 一种图像处理方法和相关装置
CN115272794A (zh) 模型训练方法、计算机设备及存储介质
WO2022226744A1 (en) Texture completion
CN114973424A (zh) 特征提取模型训练、手部动作识别方法、装置及电子设备
CN116266259A (zh) 图像文字结构化输出方法、装置、电子设备和存储介质
CN116563840B (zh) 基于弱监督跨模态对比学习的场景文本检测与识别方法
JP7479507B2 (ja) 画像処理方法及び装置、コンピューター機器、並びにコンピュータープログラム

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2023864104

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023864104

Country of ref document: EP

Effective date: 20240319

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864104

Country of ref document: EP

Kind code of ref document: A1