WO2023179420A1 - 一种图像处理方法、装置、电子设备及存储介质 - Google Patents

一种图像处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023179420A1
WO2023179420A1 PCT/CN2023/081573 CN2023081573W WO2023179420A1 WO 2023179420 A1 WO2023179420 A1 WO 2023179420A1 CN 2023081573 W CN2023081573 W CN 2023081573W WO 2023179420 A1 WO2023179420 A1 WO 2023179420A1
Authority
WO
WIPO (PCT)
Prior art keywords
window
image
data
information exchange
images
Prior art date
Application number
PCT/CN2023/081573
Other languages
English (en)
French (fr)
Inventor
李卫
王星
夏鑫
吴捷
肖学锋
郑敏
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023179420A1 publication Critical patent/WO2023179420A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present disclosure relate to the field of computer vision technology, for example, to an image processing method, device, electronic device, and storage medium.
  • the Transformer model is a deep neural network based on the self-attention mechanism. Its self-attention mechanism is not limited by local interactions, can mine long-distance dependencies and perform parallel calculations. At the same time, it can learn appropriate tasks based on different task objectives. Inductive bias. In recent years, due to the outstanding performance of the Transformer model in the field of natural language processing, vision researchers have begun cross-border research on visual transformers.
  • two Transformer models are usually used to realize information exchange within the window and information exchange between windows respectively.
  • the shortcomings of this method in related technologies include at least the following: a network containing two Transformer models has a large number of network layers, a complex structure, a large amount of calculation, a high inference delay, and a deployment that is highly restricted by platform resources, etc. Disadvantages.
  • Embodiments of the present disclosure provide an image processing method, device, electronic device and storage medium, which can perform image processing based on a lightweight Transformer model. On the basis of ensuring the processing effect, it can reduce the amount of calculation and achieve efficient and fast vision. Image processing and easy deployment on different platforms.
  • an embodiment of the present disclosure provides an image processing method, including:
  • the combined data is subjected to self-attention transformation in the pixel dimension to obtain the information exchange image within the window and the window feature data after learning is completed;
  • self-attention transformation of the window dimension is performed on multiple intra-window information exchange images respectively to obtain an inter-window information fusion image.
  • embodiments of the present disclosure also provide an image processing device, including:
  • a combination module configured to combine multiple window images of the image to be processed and window feature data to be learned corresponding to the multiple window images respectively, to obtain multiple combined data
  • the first Transformer model is configured to perform self-attention transformation on the pixel dimension for each combined data to obtain the information exchange image within the window and the window feature data after learning is completed;
  • the weight determination module is configured to determine the influence weights between the information exchange images in multiple windows based on the window feature data that have been learned from multiple windows, wherein the window feature data that have been learned from multiple windows are respectively obtained from the multiple window feature data.
  • the combined data is obtained, and a plurality of information exchange images in the window are respectively obtained from a plurality of the combined data;
  • the second Transformer model is configured to perform self-attention transformation of the window dimensions on the information exchange images in the plurality of windows according to the influence weight, to obtain an information fusion image between windows.
  • embodiments of the present disclosure also provide an electronic device, where the electronic device includes:
  • processors one or more processors
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the image processing method described in any one of the embodiments of the present disclosure.
  • embodiments of the disclosure further provide a storage medium containing computer-executable instructions, which when executed by a computer processor are used to perform image processing as described in any embodiment of the disclosure. method.
  • Figure 1 is a schematic flowchart of an image processing method provided by Embodiment 1 of the present disclosure
  • Figure 2 is a structural block diagram of a Transformer model in an image processing method provided by Embodiment 1 of the present disclosure
  • Figure 3 is a schematic diagram of grouping and merging windows in an image processing method provided by Embodiment 1 of the present disclosure
  • FIG. 4 is a schematic flowchart of an image processing method provided by Embodiment 2 of the present disclosure.
  • Figure 5 is a structural block diagram of an image processing network in an image processing method provided in Embodiment 2 of the present disclosure
  • Figure 6 is a schematic structural diagram of an image processing device provided in Embodiment 3 of the present disclosure.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by Embodiment 4 of the present disclosure.
  • the term “include” and its variations are open-ended, ie, “including but not limited to.”
  • the term “based on” means “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a schematic flowchart of an image processing method provided by Embodiment 1 of the present disclosure.
  • the embodiments of the present disclosure are suitable for situations where the image to be processed is exchanged for information within a window and between windows through a window-based Transformer model.
  • the method may be performed by an image processing device, which may be implemented in the form of software and/or hardware, and may be configured in an electronic device, such as a computer.
  • the image processing method provided by this embodiment may include: performing the following steps through the Transformer model:
  • the Transformer model applied to image processing can also be called a visual Transformer model, and can include Transformer-like models and models derived from the Transformer model.
  • a window-based self-attention mechanism In the field of visual image processing, in order to reduce the computational complexity of the self-attention model, a window-based self-attention mechanism is proposed. That is, the image is divided into several non-overlapping local window images according to a fixed window size, and the calculation of attention is limited to the local window image. This mechanism cannot establish information exchange between windows.
  • the embodiment of the present disclosure can set corresponding window feature data to be learned for each window image after obtaining multiple window images of the image to be processed.
  • the reason why it is called the window feature data "to be learned" is because the window feature data set at this time is data set based on empirical values or experimental values, etc., and cannot fully represent the information contained in the window.
  • the number of channels of the window feature data can be equal to the number of channels of the window image
  • the height ⁇ width of the window feature data can be 1 ⁇ 1, or other sizes.
  • the combined data can be considered to be the information of one pixel embedded on the basis of the original window image.
  • each combined data can perform self-attention transformation in the local pixel dimension separately. It can be considered that the self-attention transformation in the pixel dimension of multiple combined data is executed in parallel and does not affect each other.
  • the self-attention transformation mechanism can be used to determine the pixel weights of different pixels in the window image. Among them, the greater the pixel weight, the greater the impact of the pixel on the execution results of subsequent visual tasks. Afterwards, the pixels in the combined data can be multiplied by the corresponding weights to achieve self-attention transformation in the pixel dimension.
  • the data corresponding to the original window image can constitute the information exchange image within the window, and the data corresponding to the original window feature data to be learned can be called the learned window feature data.
  • the window feature data can be learned Corresponding to the global feature representation of the local window, it is also possible to communicate with each other between channel information in the local window image.
  • S130 Determine the influence weights between information exchange images in multiple windows based on multiple learned window feature data.
  • the multiple learned window feature data are respectively obtained from multiple of the combined data, and multiple learned window feature data are obtained from multiple combined data.
  • the information exchange images in each of the windows are respectively obtained from a plurality of the combined data.
  • the self-attention transformation mechanism can be used to model the attention relationship between different windows based on the window feature data completed by multiple learning, that is, determine the multi-window feature data.
  • the influence weight between information exchange images within a window can be used to model the attention relationship between different windows based on the window feature data completed by multiple learning, that is, determine the multi-window feature data.
  • the embodiments of the present disclosure can model the attention relationship between windows based on the window feature data learned during the self-attention transformation process in the pixel dimension, which can save the deployment and deployment of another model in the traditional solution.
  • the computational process enables modeling the attention relationship between windows with little overhead.
  • the influence weights between the information exchange images within the window may be asymmetric.
  • the influence weight of window A on window B may be 0.7, while the influence weight of window B on window A may be 0.4.
  • the influence weights of other windows on the window can be determined from multiple influence weights.
  • the process of self-attention transformation in the window dimension for the information exchange image in each window may include: according to the influence weight of other windows on the window, the information exchange image in the window is integrated into the information exchange image in other windows. information to obtain the corresponding information fusion image between windows, so that the information fusion image between the window and other windows can be obtained to achieve global information exchange between windows.
  • two consecutive self-attention transformations can be performed on the window image based on a single Transformer model, which are, in sequence, a self-attention transformation in the pixel dimension and a self-attention transformation in the window dimension.
  • Information exchange within local windows can be achieved through self-attention transformation in the pixel dimension
  • information exchange between global windows can be achieved through self-attention transformation in the window dimension.
  • the embodiments of the present disclosure can achieve the same processing effect based on a single Transformer model, greatly simplifying the network structure and enabling The Transformer model is made more lightweight. This can reduce the amount of calculation, achieve efficient and fast visual image processing, and facilitate deployment on different platforms.
  • FIG. 2 is a structural block diagram of a Transformer model in an image processing method provided by Embodiment 1 of the present disclosure.
  • the Transformer model can include the layer normalization layer (Layer Normalization, LN), the separable self-attention module (Separable Attention, Sep-Attn), the LN layer and the multi-layer perceptron (Multi-Layer Perceptron, MLP) in sequence. and other network layers, and discontinuous layers can have connectivity relationships (as shown in Figure 2, there can be line connections with arrows between discontinuous layers, and the connections can be marked with a plus sign inside a circle to indicate data fusion).
  • Layer Normalization LN
  • Separable Attention Separable Attention
  • Sep-Attn the LN layer
  • MLP multi-layer perceptron
  • the input of the Transformer model can include the image to be processed, and the Transformer model can set multiple window feature data to be learned (represented by win_tokens in the figure), and each window feature data to be learned corresponds to each window image of the image to be processed.
  • the size of the image to be processed in Figure 2 can be 6 ⁇ 6 ⁇ C, which can be divided into 4 window images, and the size of each window image can be 3 ⁇ 3 ⁇ C.
  • C represents the number of channels.
  • the window image in the image to be processed can correspond to the window feature data (i.e. win_tokens) at the corresponding position in win_tokens.
  • the window image at the upper right corner of the image to be processed can correspond to the win_token in the upper right corner of win_tokens, and each win_token
  • the size can be 1 ⁇ 1 ⁇ C.
  • Each window image and the corresponding window feature data to be learned can enter the Sep-Attn module after being normalized by the LN layer.
  • the self-attention transformation of the pixel dimension represented by Depthwise self-attention in the figure
  • the self-attention transformation of the window dimension represented by Pointwise self-attention in the figure
  • subjecting the combined data to pixel-dimensional self-attention transformation includes: for each combined data, subjecting the combined data to a matrix transformation to obtain the same Combining the first query vector, the first key vector and the first value vector corresponding to the data; determining the first attention map according to the first query vector and the first key vector, and combining the first attention map and the first value vector Multiply first value vectors.
  • each combined data is composed of the window image and its corresponding win_token.
  • the window data in the figure contains 9 pixels of data, and win_token is equivalent to 1 pixel of data, that is, the combined data is equivalent to 10 pixels of data.
  • the four combined data can perform self-attention transformation in the pixel dimension respectively.
  • the figure only shows the self-attention transformation process in the pixel dimension of one of the combined data.
  • the transformation process of other combined data is the same.
  • the figure uses "... "express.
  • the pixel-dimensional self-attention transformation process shown in Figure 2 may include: performing matrix transformation on the combined data based on the self-attention mechanism (i.e. W Q1 , W K1 , W V1 in the figure) to obtain the corresponding first query vector Q1, the first key vector K1 and the first value vector V1.
  • the first attention map can be determined.
  • the first attention map and the corresponding pixel in the first value vector V1 are multiplied to change the weight of the pixel, and the combined data after the self-attention transformation of the pixel dimension can be obtained.
  • the data corresponding to the original window image can constitute the information exchange image within the window, and the data corresponding to the original window feature data to be learned can be called the learned window feature data.
  • the influence weights between information exchange images in multiple windows are determined based on multiple learned window feature data, including: splicing multiple learned window feature data to obtain spliced data ; Perform matrix transformation on the spliced data after layer normalization and activation processing to obtain the second query vector and the second key vector; determine the second attention map based on the second query vector and the second key vector, and determine the multi-dimensional vector based on the second attention map.
  • the influence weight between information exchange images within a window including: splicing multiple learned window feature data to obtain spliced data ; Perform matrix transformation on the spliced data after layer normalization and activation processing to obtain the second query vector and the second key vector; determine the second attention map based on the second query vector and the second key vector, and determine the multi-dimensional vector based on the second attention map.
  • win_tokens completed by learning can be spliced into spliced data (in the figure, win_tokens are used again, but the win_tokens at this time can be used to represent the global information of the corresponding window).
  • win_tokens can be layer normalized and activated through the LN layer and activation layer (Activation, Act) to adjust win_tokens to make it more reasonable.
  • win_tokens can be matrix transformed based on the self-attention mechanism (ie, W Q2 , W K2 in the figure) to obtain the second query vector Q2 and the second key vector K2.
  • the second attention map can be determined by multiplying the second query vector Q2 and the second key vector K2.
  • the multiple win_tokens in Figure 2 include 4 win_tokens, and each win_token can represent the information of the corresponding window.
  • the height ⁇ width of the corresponding determined second attention map can be 4 ⁇ 4, and each element in the second attention map can represent the influence weight of the window where the column is located on the window where the row is located. It can be considered that the influence weights between information exchange images in multiple windows can be found based on the second attention map.
  • self-attention transformation of the window dimension is performed on the information exchange images in multiple windows according to the influence weight, including: for the information exchange information in each window, according to the influence weight, determine the information exchange information in other windows.
  • the information exchange image affects the target influence weight of the information exchange image in this window; the information exchange images in other windows are multiplied by the corresponding target influence weight, and the multiplication result is integrated into the information exchange image in this window.
  • the figure contains four information exchange images in the window.
  • the information exchange image in each window itself can be called the information exchange image in this window, and the images other than itself in the information exchange image in the window can be Called the information exchange image within other windows.
  • Figure 2 only shows the process of window-dimensional self-attention transformation of the information exchange image in one of the windows.
  • the transformation process of the information exchange image in other windows is the same, and is represented by a dotted arrow in the figure.
  • the window-dimensional self-attention transformation process shown in Figure 2 may include: from the second attention map Select the row element corresponding to the information exchange image in this window. Each element in the row element can represent the weight of the window where the column is located on the target of the current window.
  • each element in the row element can be used as a target influence weight, that is, the target influence weight includes the influence weight of other windows on this window. It can also include the weight of this window's influence on itself.
  • the information exchange images in other windows are multiplied by the corresponding target influence weights, and the multiplication results are integrated into the information exchange images in this window, which may include: combining the information exchange images in multiple windows with the corresponding target influences The weights are multiplied and added.
  • the information exchange image in each window undergoes the above-mentioned window dimension transformation, and the information fusion image between all windows can be obtained.
  • after obtaining multiple combined data it also includes: grouping and merging the multiple combined data to obtain merged window data; correspondingly, performing self-attention in the pixel dimension on the multiple combined data. Transformation, including: performing self-attention transformation on the pixel dimension of the merged window data.
  • FIG. 3 is a schematic diagram of grouping and merging windows in an image processing method provided by Embodiment 1 of the present disclosure.
  • 16 window images are included before grouping, these 16 window images can also be grouped and merged.
  • 4 adjacent window images can be divided into a group and merged into a large window, that is, 4 large window image.
  • each large window image can be combined with the corresponding window feature data to be learned to obtain merged window data.
  • the merged window data can be obtained.
  • multiple merge window data can be subjected to self-attention transformation in the pixel dimension to obtain the information exchange images in multiple merge windows and the window feature data of multiple merge windows after learning has been completed.
  • the pixels where the dots are located in the window/large window image can communicate with other pixels in the window/large window.
  • the influence weights between the information exchange images in multiple merge windows can be determined based on the window feature data of the merged windows that have been learned. Referring to Figure 3, if the feature data of the window where the dot is located is used as the window feature data corresponding to the current window, then the weight of the influence of other windows on the target of the current window can be determined.
  • the self-attention transformation of the window dimension can be performed on the information exchange images in multiple merged windows to obtain the information fusion image between merged windows.
  • the corresponding self-attention transformed pixels can be obtained.
  • the technical solution of the embodiment of the present disclosure combines multiple window images of the image to be processed and the window feature data to be learned corresponding to the multiple window images respectively, to obtain multiple combined data; for each combined data, The combined data is subjected to self-attention transformation in the pixel dimension to obtain the information exchange image within the window and the learned window feature data; based on the multiple learned window feature data corresponding to the multiple combined data, determine the respective values of the multiple combined data.
  • the influence weights between the corresponding information exchange images in multiple windows; according to the influence weights, the self-attention transformation of the window dimensions is performed on the multiple information exchange images in the windows to obtain the information fusion image between windows.
  • the window feature data can learn the information of all pixels in the window, so that the learned window features can be The data can completely represent the information of the corresponding window.
  • the influence weight between the information exchange images within the window can be modeled, and the attention relationship between windows can be modeled with very little computational overhead.
  • the information exchange images in each window can be integrated into the information exchange images in other windows, so that global information exchange can be achieved.
  • the technical solution of the embodiment of the present disclosure can achieve the same processing effect based on a single Transformer model, greatly simplifying the network structure. Make the Transformer model more lightweight. This can reduce the amount of calculation, achieve efficient and fast visual image processing, and facilitate deployment on different platforms.
  • the embodiments of the present disclosure can be combined with various options in the image processing methods provided in the above embodiments.
  • the image processing method provided in this embodiment can be applied to at least one of the following image processing networks: image classification network, image segmentation network and image detection network, and can achieve various visual tasks such as image classification, semantic segmentation and target detection. While reaching the more advanced level in the field, it also makes the model more lightweight.
  • the image processing network can adopt a multi-stage hierarchical structure, and each stage can include a downsampling layer and the Transformer model provided by the embodiment of the present disclosure, so that the target feature image of the original image can be extracted.
  • the Transformer model can also be used to configure the corresponding window feature data to be learned for each window image of the currently processed original image based on the window feature data corresponding to the historically processed image, thereby achieving iterative learning optimization of the window feature data.
  • FIG. 4 is a schematic flowchart of an image processing method provided by Embodiment 2 of the present disclosure. As shown in Figure 4 As shown, the image processing method provided by this embodiment is applied to at least one of the following image processing networks: image classification network, image segmentation network and image detection network, and may include:
  • the original image may be an image that requires image classification, semantic segmentation or target detection.
  • the original image may be an image collected in real time or a pre-stored image. Through the relevant down-sampling method, a higher semantic feature image of the original image can be obtained, and the feature image can be used as the image to be processed.
  • S420 Use the sliding window to slide on the image to be processed to obtain multiple window images that do not overlap each other.
  • the Transformer model can set the window feature data to be learned corresponding to each window image in the image to be processed currently based on the learned window feature data corresponding to the processed image, so that the window feature data can be continuously learned. renew.
  • the optimized and updated window feature data is more reasonable, which can speed up the efficiency of self-attention transformation in the pixel dimension.
  • S440 Combine multiple window images of the image to be processed and window feature data to be learned corresponding to the multiple window images to obtain multiple combined data.
  • S460 Determine the influence weights between the information exchange images in the multiple windows corresponding to the multiple combined data respectively based on the multiple learned window feature data respectively corresponding to the multiple combined data.
  • the information fusion images between windows can be spliced according to the positions of the corresponding original window images to obtain a spliced image.
  • the spliced image can continue to be processed by other network layers of the Transformer model (such as LN layer, MLP layer), and the processing result output by the Transformer model can be called the attention-transformed image.
  • S490 Downsample the attention-transformed image to obtain a new image to be processed until the attention-transformed image becomes the target feature image.
  • the image after attention transformation can be considered as the down-sampled feature image for self-attention.
  • Image of force transformation By performing self-attention transformation on the feature image, the feature image can be beneficial to improving the execution accuracy of visual tasks.
  • the down-sampling operation can be continued to obtain a higher-level feature image, and the higher-level feature image can be used as a new image to be processed to perform S420 again.
  • -S480 step This cycle continues until the feature image of the demand level is determined.
  • the feature image of the demand level can be attention transformed, and the corresponding attention-transformed image is used as the target feature image.
  • FIG. 5 is a structural block diagram of an image processing network in an image processing method provided in Embodiment 2 of the present disclosure.
  • the image processing network can adopt a multi-stage hierarchical structure.
  • the figure shows four hierarchical structures.
  • the hierarchical structure of each stage may include a downsampling layer (represented by Patch Merging in the figure) and the Transformer model provided by the embodiment of the present disclosure (represented by Sep-ViT Block in the figure).
  • a downsampling layer represented by Patch Merging in the figure
  • the Transformer model represented by the embodiment of the present disclosure
  • the original image size is H ⁇ W ⁇ 3;
  • the size of the downsampled image in the first hierarchical structure is The size of the downsampled image in the second hierarchical structure is The size of the downsampled image in the third hierarchical structure is The size of the downsampled image in the fourth hierarchical structure is The image output by the fourth hierarchical structure will continue to undergo subsequent processing to obtain the target feature image, and perform corresponding visual tasks based on the target feature image.
  • image classification network image classification network
  • image segmentation network image detection network
  • various visual tasks such as image classification, semantic segmentation and target detection. While achieving a relatively advanced level in the field, it also makes the model more lightweight.
  • the image processing network can adopt a multi-stage hierarchical structure, and each stage can include a downsampling layer and the Transformer model provided by the embodiment of the present disclosure, so that the target feature image of the original image can be extracted.
  • the Transformer model can also be used to configure the corresponding window feature data to be learned for each window image of the currently processed original image based on the window feature data corresponding to the historically processed image, thereby achieving iterative learning optimization of the window feature data.
  • FIG. 6 is a schematic structural diagram of an image processing device provided in Embodiment 3 of the present disclosure.
  • the image processing device provided in this embodiment is suitable for situations where the image to be processed is exchanged within a window and between windows through a window-based Transformer model.
  • the image processing device provided by the embodiment of the present disclosure may include:
  • the combination module 610 is configured to combine multiple window images of the image to be processed and the window feature data to be learned corresponding to each window image to obtain multiple combined data;
  • the first self-attention module 620 is configured to perform a pixel-dimensional self-attention transformation on the combined data for each combined data to obtain the information exchange image within the window and the window feature data after learning is completed;
  • the weight determination module 630 is configured to determine the influence weights between the information exchange images in multiple windows based on multiple learned window feature data, wherein the multiple learned window feature data are respectively obtained from multiple combinations. Data is obtained, and multiple information exchange images within the window are obtained from multiple combined data respectively;
  • the second self-attention module 640 is configured to perform window-dimensional self-attention transformation on the information exchange images in multiple windows according to the influence weight, so as to obtain an information fusion image between windows.
  • the first self-attention module 620 may be configured to perform a pixel-dimensional self-attention transformation on each combined data in the following manner:
  • For each combined data perform matrix transformation on the combined data to obtain a first query vector, a first key vector and a first value vector corresponding to the combined data;
  • a first attention map is determined based on the first query vector and the first key vector, and the first attention map and the first value vector are multiplied.
  • the weight determination module 630 can be configured to determine the influence weights between the information exchange images in multiple windows based on the window feature data completed by multiple learning in the following manner:
  • the spliced data is subjected to matrix transformation to obtain the second query vector and the second key vector;
  • the second attention map is determined according to the second query vector and the second key vector, and the influence weights between the information exchange images in multiple windows are determined according to the second attention map.
  • the second self-attention module 640 can be configured to perform window-dimensional self-attention transformation on the information exchange images in multiple windows according to the influence weight in the following manner:
  • the Transformer model can also include:
  • the merging module is configured to group and merge multiple combined data after obtaining multiple combined data to obtain multiple merged window data
  • the first self-attention module 620 may be configured to: for each merged window data, perform a pixel-dimensional self-attention transformation on the merged window data.
  • the image processing device can be applied to at least one of the following image processing networks: an image classification network, an image segmentation network, and an image detection network.
  • the image processing device may also include:
  • the downsampling layer is set to downsample the input original image before combining multiple window images of the image to be processed and the window feature data to be learned corresponding to the multiple window images through the Transformer model to obtain the to-be-processed image.
  • the sliding window is used to slide on the image to be processed to obtain multiple window images that do not overlap with each other;
  • the window feature data to be learned corresponding to the plurality of window images are determined.
  • the Transformer model is also used to determine the attention-transformed image based on the information fusion image between windows after obtaining the information fusion image between windows through the Transformer model;
  • the image after attention transformation is downsampled to obtain a new image to be processed until the image after attention transformation becomes the target feature image.
  • the image processing device provided by the embodiments of the present disclosure can execute the image processing method provided by any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.
  • Terminal devices in embodiments of the present disclosure may include mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), tablet computers (Portable Android Device, PAD), portable multimedia players (Portable Media Player , PMP), mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital televisions (television, TV), desktop computers, etc.
  • PDA Personal Digital Assistant
  • PAD Portable Multimedia Player
  • PMP Portable Multimedia Player
  • mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals)
  • fixed terminals such as digital televisions (television, TV), desktop computers, etc.
  • the electronic device shown in FIG. 7 is only an example.
  • the electronic device 700 may include a processing device (such as a central processing unit, a graphics processor, etc.) 701, and the electronic device 700 may process data according to a program stored in a read-only memory (Read-Only Memory, ROM) 702 or from a program.
  • the storage device 706 loads the program in the random access memory (Random Access Memory, RAM) 703 to perform various appropriate actions and processes.
  • RAM Random Access Memory
  • various programs and data required for the operation of the electronic device 700 are also stored.
  • the processing device 701, ROM 702 and RAM 703 are connected to each other via a bus 704.
  • An input/output (I/O) interface 705 is also connected to bus 704.
  • input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) , an output device 707 such as a speaker, a vibrator, etc.; a storage device 708 including a magnetic tape, a hard disk, etc.; and a communication device 709.
  • Communication device 709 may allow electronic device 700 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 7 illustrates an electronic device 700 having various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program including program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 709, or from storage device 706, or from ROM 702.
  • the processing device 701 the above-mentioned functions in the image processing method of the embodiment of the present disclosure are performed.
  • the electronic device provided by the embodiments of the present disclosure and the image processing method provided by the above embodiments belong to the same disclosed concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment has the same features as the above embodiments. beneficial effects.
  • Embodiments of the present disclosure provide a computer storage medium on which a computer program is stored.
  • the program is executed by a processor, the image processing method provided by the above embodiments is implemented.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof.
  • Computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard drives, random access memory, read-only memory, Erasable Programmable Read-Only Memory , EPROM) or flash memory (FLASH), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium can be transmitted using any appropriate medium, including: wire, optical cable, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
  • the client and server can communicate using any currently known or future developed network protocol, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • HTTP HyperText Transfer Protocol
  • Data communications e.g., communications network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any current network for knowledge or future research and development.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet e.g., the Internet
  • end-to-end networks e.g., ad hoc end-to-end networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device executes the above-mentioned one or more programs.
  • the following steps are performed through the Transformer model: multiple window images of the image to be processed and the window feature data to be learned corresponding to the multiple window images are combined to obtain multiple combined data; for each combined data, all the The above combined data are respectively subjected to self-attention transformation in the pixel dimension to obtain the information exchange image within the window and the window feature data completed by learning; based on the multiple learned window feature data, the influence between the multiple obtained information exchange images within the window is determined Weight; according to the influence weight, self-attention transformation of the window dimension is performed on the information exchange images in the multiple windows respectively, and we obtain Information fusion between windows.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware. Among them, the name of a unit or module does not constitute a limitation on the unit or module itself under certain circumstances.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • Machine-readable storage media may include Electrical connection based on one or more wires, portable computer disk, hard disk, RAM, ROM, erasable programmable read-only memory (EPROM or flash memory), optical fiber, CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the above.
  • Example 1 provides an image processing method, which includes:
  • the combined data is subjected to self-attention transformation in the pixel dimension to obtain the information exchange image within the window and the window feature data after learning is completed;
  • self-attention transformation of the window dimension is performed on multiple intra-window information exchange images respectively to obtain an inter-window information fusion image.
  • Example 2 provides an image processing method, further including:
  • performing a pixel-dimensional self-attention transformation on the combined data includes:
  • For each combined data perform matrix transformation on the combined data to obtain a first query vector, a first key vector and a first value vector corresponding to the combined data;
  • a first attention map is determined based on the first query vector and the first key vector, and the first attention map and the first value vector are multiplied.
  • Example 3 provides an image processing method, further including:
  • determining the influence weights between the information exchange images in multiple windows based on the window feature data completed by multiple learnings includes:
  • the second attention map determines the influence weights between information exchange images within a plurality of said windows.
  • Example 4 provides an image processing method, further including:
  • the window-dimensional self-attention transformation is performed on multiple information exchange images within the window according to the influence weight, including:
  • For the information exchange image in each window determine the target influence weight of the information exchange image in other windows on the information exchange image in this window according to the influence weight;
  • the information exchange images in other windows are multiplied by the corresponding target influence weights, and each multiplication result is integrated into the information exchange image in this window.
  • Example 5 provides an image processing method, further including:
  • the method further includes: grouping and merging the multiple combined data to obtain multiple merge window data;
  • the step of performing a pixel-dimensional self-attention transformation on the combined data for each combined data includes: performing a pixel-dimensional self-attention transformation on the merged window data for each combined data.
  • Example 6 provides an image processing method, further including:
  • the method is applied to at least one of the following image processing networks: image classification network, image segmentation network and image detection network.
  • Example 7 provides an image processing method, further including:
  • the method before combining multiple window images of the image to be processed and the window feature data to be learned corresponding to each window image through the Transformer model, the method further includes:
  • a sliding window is used to slide on the image to be processed to obtain the multiple window images that do not overlap each other;
  • the window feature data to be learned corresponding to the plurality of window images are determined.
  • Example 8 provides an image processing method, further comprising:
  • the method further includes:
  • the attention-transformed image is determined based on the information fusion image between windows;
  • the image after attention transformation is down-sampled to obtain a new image to be processed until the image after attention transformation becomes the target feature image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

本公开实施例公开了一种图像处理方法、装置、电子设备及存储介质,其中,该方法包括:通过Transformer模型执行下述步骤:将待处理图像的多个窗口图像,以及与多个窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的窗口特征数据;根据多个学习完成的窗口特征数据,确定多个窗口内信息交流图像之间的影响权重;根据影响权重,对多个窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。

Description

一种图像处理方法、装置、电子设备及存储介质
本公开要求在2022年03月24日提交中国专利局、申请号为202210301372.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本公开中。
技术领域
本公开实施例涉及计算机视觉技术领域,例如涉及一种图像处理方法、装置、电子设备及存储介质。
背景技术
Transformer模型是一种基于自注意力机制的深度神经网络,其自注意力机制可不受局部相互作用的限制,能挖掘长距离的依赖关系又能并行计算,同时可根据不同的任务目标学习合适的归纳偏置。近年来,由于Transformer模型在自然语言处理领域的突出表现,使得视觉研究者开始了视觉transformer的跨界研究。
相关基于窗口的Transformer模型进行图像处理的方法中,通常使用先后两个Transformer模型来分别实现窗口内信息交流和窗口间信息交换。相关技术中该方法的不足之处至少包括:包含两个Transformer模型的网络,其网络层数较多、结构复杂,存在计算量较大、推理时延较高、部署受平台资源限制较大等弊端。
发明内容
本公开实施例提供了一种图像处理方法、装置、电子设备及存储介质,能够基于轻量化的Transformer模型进行图像处理,在保证处理效果的基础上,可减少计算量、实现高效和快速的视觉图像处理,且便于不同平台的部署。
第一方面,本公开实施例提供了一种图像处理方法,包括:
通过Transformer模型执行下述步骤:
将待处理图像的多个窗口图像,以及与多个所述窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;
针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的所述窗口特征数据;
根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从多个所述组合数据得到;
根据所述影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
第二方面,本公开实施例还提供了一种图像处理装置,包括:
Transformer模型,用于执行下述模块的功能:
组合模块,设置为将待处理图像的多个窗口图像,以及与多个所述窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;
第一Transformer模型,设置为针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的所述窗口特征数据;
权重确定模块,设置为根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从多个所述组合数据得到;
第二Transformer模型,设置为根据所述影响权重,对所述多个窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
第三方面,本公开实施例还提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,用于存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开实施例任一所述的图像处理方法。
第四方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开实施例任一所述的图像处理方法。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1为本公开实施例一所提供的一种图像处理方法的流程示意图;
图2为本公开实施例一所提供的一种图像处理方法中Transformer模型的结构框图;
图3为本公开实施例一所提供的一种图像处理方法中分组合并窗口的示意图;
图4为本公开实施例二所提供的一种图像处理方法的流程示意图;
图5为本公开实施例二所提供的一种图像处理方法中图像处理网络的结构框图;
图6为本公开实施例三所提供的一种图像处理装置的结构示意图;
图7为本公开实施例四所提供的一种电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
实施例一
图1为本公开实施例一所提供的一种图像处理方法的流程示意图。本公开实施例适用于通过基于窗口的Transformer模型,对待处理图像进行窗口内信息交流和窗口间信息交换的情形。该方法可以由图像处理装置来执行,该装置可以通过软件和/或硬件的形式实现,该装置可配置于电子设备中,例如配置于计算机中。
如图1所示,本实施例提供的图像处理方法,可以包括:通过Transformer模型执行下述步骤:
S110、将待处理图像的多个窗口图像,以及与所述多个窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据。
应用于图像处理的Transformer模型,也可以称为视觉Transformer模型,且可以包括类Transformer模型,以及在Transformer模型基础上衍生的模型。
在视觉图像处理领域,为了减缓自注意模型的计算复杂度,提出了基于窗口的自注意力机制。即,将图像按照固定的窗口大小划分为几个非重叠的局部窗口图像,并将注意力的计算限制在局部的窗口图像中。该种机制并不能建立窗口之间的信息交流。
本公开实施例在上述机制基础上,在得到待处理图像的多个窗口图像后,可以为每个窗口图像设置对应的待学习的窗口特征数据。之所以称其为“待学习”的窗口特征数据,是因为此时设置的窗口特征数据为根据经验值或实验值等设置的数据,并不能完全表征该窗口所包含的信息。
将多个窗口图像与各自对应的待学习的窗口特征数据进行组合,可以认为是将窗口特征数据嵌入需要学习的对应窗口图像中。其中,窗口特征数据的通道数可以等于窗口图像的通道数,窗口特征数据的高×宽可以为1×1,也可以为其他尺寸。示例性的,当窗口特征数据的高×宽为1×1时,可以认为组合数据为在原窗口图像的基础上嵌入了一个像素的信息。
S120、针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的窗口特征数据。
本公开实施例中,每一个组合数据可分别进行局部像素维度的自注意力变换,可认为多个组合数据的像素维度的自注意力变换为并行执行,互不影响的。其中,针对每一个组合数据,可以利用自注意力变换机制,确定窗口图像中不同像素点的像素权重。其中,像素权重越大,可认为该像素对后续视觉任务的执行结果影响越大。之后,可将组合数据中的像素分别与对应的权重相乘,以实现像素维度的自注意力变换。
在经像素维度的自注意力变换后的组合数据中,原窗口图像对应的数据可构成窗口内信息交流图像,原待学习的窗口特征数据对应的数据可称为学习完成的窗口特征数据。通过将待学习的窗口特征数据嵌入对应的窗口图像中,并利用自注意力机制将窗口特征数据和其所在的局部窗口中所有的像素信息进行自注意力信息交流,可以使窗口特征数据学习到对应局部窗口的全局特征表示,还可以实现局部窗口图像内各通道信息进行相互交流。
S130、根据多个学习完成的窗口特征数据,确定多个窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从多个所述组合数据得到。
由于每个学习完成的窗口特征数据可以完全表征对应窗口中的信息,可利用自注意力变换机制,根据多个学习完成的窗口特征数据,建模不同窗口之间的注意力关系,即确定多个窗口内信息交流图像之间的影响权重。
在传统的基于窗口的自注意力机制之后,需要通过设置新的模型来建立窗口之间的信息交流。相较于传统方式,本公开实施例可根据像素维度的自注意力变换过程中学习完成的窗口特征数据,建模窗口之间的注意力关系,可省去传统方案中另一模型的部署、计算过程,能够实现以很少的开销建模窗口之间的注意力关系。
S140、根据所述影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
窗口内信息交流图像之间的影响权重可具有非对称性,例如窗口A对窗口B的影响权重可能为0.7,而窗口B对窗口A的影响权重可能为0.4。针对每一个窗口内信息交流图像,可从多个影响权重中确定出其他窗口分别对该窗口的影响权重。
之后,对每一个窗口内信息交流图像进行窗口维度的自注意力变换的过程,可包括:根据其他窗口对该窗口的影响权重,在该窗口内信息交流图像中,融入其他窗口内信息交流图像的信息,以得到对应的窗口间信息融合图像,从而可以得到该窗口与其他窗口间信息融合图像,实现窗口之间的全局信息交换。
本公开实施例中,可基于单个Transformer模型,对窗口图像进行两个连续的自注意力变换,依次为像素维度的自注意力变换和窗口维度的自注意力变换。通过像素维度的自注意力变换可实现局部窗口内的信息交流,通过窗口维度的自注意力变换可实现全局窗口间的信息交换。相较于相关技术中的基于两个Transformer模型来分别实现窗口内信息交流和窗口间信息交换来讲,本公开实施例可基于单个Transformer模型实现相同处理效果,大大精简了网络结构,使 得Transformer模型更加轻量化。从而可减少计算量、实现高效和快速的视觉图像处理,且可便于不同平台的部署。
示例性的,图2为本公开实施例一所提供的一种图像处理方法中Transformer模型的结构框图。参见图2,Transformer模型可以依次包含层规范化层(Layer Normalization,LN)、可分离的自注意力模块(Separable Attention,Sep-Attn)、LN层和多层感知机(Multi-Layer Perceptron,MLP)等网络层,且非连续层间可具备连通关系(如图2中非连续层间可存在带箭头的线条连接,且连接处可用圆内加加号的标识表示数据融合)。
其中,Transformer模型的输入可包含待处理图像,Transformer模型可以设置待学习的多个窗口特征数据(图中用win_tokens表示),且每个待学习的窗口特征数据与待处理图像的各窗口图像对应。例如,图2中待处理图像的尺寸可以为6×6×C,可以将其划分为4个窗口图像,每个窗口图像的尺寸可以为3×3×C。其中,C代表通道数。待处理图像中窗口图像可以与win_tokens中对应位置的窗口特征数据(即win_tokens)相对应,例如待处理图像中右上角位置的窗口图像,可以与win_tokens中右上角的win_token相对应,且每个win_token的尺寸可以为1×1×C。
每个窗口图像和对应的待学习的窗口特征数据,在经LN层进行规范化后,可进入Sep-Attn模块。基于Sep-Attn模块,可执行本公开实施例中的像素维度的自注意力变换(图中用Depthwise自注意力表示)和窗口维度的自注意力变换(图中用Pointwise自注意力表示)等步骤。
在一些可选的实现方式中,针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,包括:针对每个组合数据,将所述组合数据进行矩阵变换,得到与所述组合数据对应的第一查询向量、第一键向量和第一值向量;根据所述第一查询向量和所述第一键向量确定第一注意力图,并将所述第一注意力图和所述第一值向量相乘。
再次参见图2,图中包含4个组合数据,每个组合数据由窗口图像和其对应的win_token组合而成。图中窗口数据包含9个像素的数据,win_token相当于1个像素的数据,即组合数据相当于包含10个像素的数据。4个组合数据可分别进行像素维度的自注意力变换,图中仅展示了其中一个组合数据的像素维度的自注意力变换过程,其他组合数据的变换过程同理,图中用“...”表示。
图2中展示的像素维度的自注意力变换过程,可包括:基于自注意力机制将组合数据进行矩阵变换(即图中的WQ1,WK1,WV1),得到对应的第一查询向量Q1、第一键向量K1和第一值向量V1。将第一查询向量Q1和第一键向量K1相乘(图中用圆内加乘号的标识表示,且下文涉及相乘过程可采用同种标识 表示),可确定第一注意力图。第一注意力图和第一值向量V1中对应的像素相乘,以改变像素的权重,可得到经像素维度的自注意力变换后的组合数据。
在经像素维度的自注意力变换后的组合数据中,原窗口图像对应的数据可构成窗口内信息交流图像,原待学习的窗口特征数据对应的数据可称为学习完成的窗口特征数据。
在一些可选的实现方式中,根据多个学习完成的窗口特征数据,确定多个窗口内信息交流图像之间的影响权重,包括:将多个学习完成的窗口特征数据进行拼接,得到拼接数据;将拼接数据经层规范化和激活处理后进行矩阵变换,得到第二查询向量和第二键向量;根据第二查询向量和第二键向量确定第二注意力图,并根据第二注意力图确定多个窗口内信息交流图像之间的影响权重。
再次参见图2,由学习完成的多个win_token可拼接为拼接数据(图中再次用win_tokens表示,但此时的win_tokens可用于表征对应窗口的全局信息)。win_tokens可经LN层和激活层(Activation,Act)进行层规范化和激活处理,以调整win_tokens,使其更具合理性。win_tokens经层规范化和激活处理后,可基于自注意力机制进行矩阵变换(即图中的WQ2,WK2),得到第二查询向量Q2和第二键向量K2。将第二查询向量Q2和第二键向量K2相乘,可确定第二注意力图。
示例性的,图2中多个win_tokens中包含4个win_token,每个win_token可表征对应窗口的信息。对应确定的第二注意力图的高×宽可以为4×4,且第二注意力图中的每个元素,可以表示列所在的窗口对行所在的窗口的影响权重。可以认为,根据第二注意力图可查找到多个窗口内信息交流图像之间的影响权重。
在一些可选的实现方式中,根据影响权重,对多个窗口内信息交流图像分别进行窗口维度的自注意力变换,包括:针对每个窗口内信息交流信息,根据影响权重,确定其他窗口内信息交流图像对本窗口内信息交流图像的目标影响权重;将其他窗口内信息交流图像与对应的目标影响权重进行相乘,将相乘结果融入本窗口内信息交流图像中。
再次参见图2,图中包含4个窗口内信息交流图像,对每个窗口内信息交流图像来讲,自身可称为本窗口内信息交流图像,窗口内信息交流图像中除自身外的图像可称为其他窗口内信息交流图像。图2中仅展示了其中一个窗口内信息交流图像进行窗口维度的自注意力变换的过程,其他窗口内信息交流图像的变换过程同理,图中用虚线箭头表示。
图2中展示的窗口维度的自注意力变换过程,可包括:从第二注意力图中 选取与本窗口内信息交流图像对应的行元素,行元素中每个元素可以表示列所在的窗口对当前窗口的目标影响权重。
为了方便将其他窗口内信息交流图像与本窗口内信息交流图像的信息融合,可以将行元素中每个元素皆作为目标影响权重,即目标影响权重除包含其他窗口对本窗口的影响权重之外,还可以包含本窗口对自身的影响权重。此时,将其他窗口内信息交流图像与对应的目标影响权重进行相乘,并将相乘结果融入本窗口内信息交流图像中,可以包括:将多个窗口内信息交流图像与对应的目标影响权重进行相乘并相加。每个窗口内信息交流图像皆进行上述窗口维度的变换,可得到所有窗口间信息融合图像。
在一些可选的实现方式中,在得到多个组合数据之后,还包括:将多个组合数据进行分组合并,得到合并窗口数据;相应的,将多个组合数据分别进行像素维度的自注意力变换,包括:将合并窗口数据进行像素维度的自注意力变换。
再次参见图2,若将每个组合数据作为一个通道的数据,则可以将相邻通道的组合数据进行合并,得到合并窗口数据。此外,图3为本公开实施例一所提供的一种图像处理方法中分组合并窗口的示意图。参见图3,假设分组前包含16个窗口图像,那么也可以将这16个窗口图像进行分组合并,例如可将相邻4个窗口图像划分为一组合并成了一个大窗口,即得到了4个大窗口图像。进而,可将每个大窗口图像分别与对应的待学习的窗口特征数据组合,可得到合并窗口数据。
由上述两种合并方式,皆可得到合并窗口数据。之后,可以将多个合并窗口数据分别进行像素维度的自注意力变换,得到多个合并窗口内信息交流图像和多个学习完成的合并窗口的窗口特征数据。参见图3,在像素维度的自注意力变换过程中,窗口/大窗口图像内圆点所在的像素,可以与窗口/大窗口内其他像素进行信息交流。
进而,可根据学习完成的合并窗口的窗口特征数据,确定多个合并窗口内信息交流图像之间的影响权重。参见图3,若将圆点所在的窗口特征数据,作为当前窗口对应的窗口特征数据,则可以确定其他窗口对当前窗口的目标影响权重。
最后,可根据影响权重,对多个合并窗口内信息交流图像进行窗口维度的自注意力变换,得到合并窗口间信息融合图像。图3中窗口/大窗口图像内圆点所在的像素,经像素维度的自注意力变换和窗口维度的自注意力变换后,可以得到对应的自注意力变换后的像素。通过将组合数据进行分组合并,可以使自注意力变换后的像素捕获多个窗口之间的长距离视觉依赖,从而可在一定程度 上提升模型性能。
本公开实施例的技术方案,将待处理图像的多个窗口图像,以及与所述多个窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;针对每个组合数据,将该组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的窗口特征数据;根据多个组合数据分别对应的多个学习完成的窗口特征数据,确定多个组合数据分别对应的多个窗口内信息交流图像之间的影响权重;根据影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
通过进行像素维度的自注意力变换,可实现窗口内的信息交流。并且,通过在进行像素维度的自注意力变换过程中,将待学习的窗口特征数据嵌入窗口图像中一起进行变换,可以使窗口特征数据学习窗口内所有像素的信息,以使学习完成的窗口特征数据能够完整表征对应窗口的信息。进而,通过窗口特征数据,可以建模窗口内信息交流图像之间的影响权重,能够实现以很少的计算开销建模窗口之间的注意力关系。最后,可根据窗口内信息交流图像之间的影响权重,将每个窗口内信息交流图像中融入其他窗口内信息交流图像,从而可实现全局信息交换。
相较于相关技术中的基于两个transformer模型来分别实现窗口内信息交流和窗口间信息交换来讲,本公开实施例的技术方案可基于单个Transformer模型实现相同处理效果,大大精简了网络结构,使得Transformer模型更加轻量化。从而可减少计算量、实现高效和快速的视觉图像处理,且可便于不同平台的部署。
实施例二
本公开实施例与上述实施例中所提供的图像处理方法中各个可选方案可以结合。本实施例所提供的图像处理方法,可应用于下述至少一种图像处理网络:图像分类网络、图像分割网络和图像检测网络,能够实现在图像分类、语义分割和目标检测等各种视觉任务上达到领域内较为先进水平的同时,使模型更加轻量化。
并且,图像处理网络可以采用多阶段的分层结构,每个阶段可以包括一个下采样层和本公开实施例提供的Transformer模型,从而可以提取原始图像的目标特征图像。此外,还可以通过Transformer模型根据历史中已处理图像对应的窗口特征数据,对当前处理的原始图像的各窗口图像配置对应的待学习的窗口特征数据,从而可以实现窗口特征数据的迭代学习优化。
图4为本公开实施例二所提供的一种图像处理方法的流程示意图。如图4 所示,本实施例提供的图像处理方法,应用于下述至少一种图像处理网络:图像分类网络、图像分割网络和图像检测网络,可以包括:
S410、对输入的原始图像进行下采样,得到待处理图像。
本实施例中,原始图像可以为需要进行图像分类、语义分割或目标检测的图像,该原始图像可以为实时采集的图像,也可以为预先存储的图像。通过相关的下采样方式,可以得到原始图像更高语义的特征图像,可以将该特征图像作为待处理图像。
通过Transformer模型执行下述S420-S480步骤:
S420、利用滑动窗口在待处理图像上滑动,得到互不重叠的多个窗口图像。
S430、根据与已处理图像对应的学习完成的窗口特征数据,确定与每个窗口图像对应的待学习的窗口特征数据。
其中,Transformer模型可以根据与已处理图像对应的学习完成的窗口特征数据,来设置当前需要处理的待处理图像中每个窗口图像对应的待学习的窗口特征数据,从而可以使窗口特征数据不断学习更新。优化更新后的窗口特征数据更具备合理性,从而可以加快像素维度的自注意力变换效率。
S440、将待处理图像的多个窗口图像,以及与所述多个窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据。
S450、针对每个组合数据,将该组合数据进行像素维度的自注意力变换,得到与该组合数据对应的窗口内信息交流图像和与该组合数据对应的学习完成的窗口特征数据。
S460、根据多个组合数据分别对应的多个学习完成的窗口特征数据,确定多个组合数据分别对应的多个窗口内信息交流图像之间的影响权重。
S470、根据影响权重,对多个窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
S480、根据窗口间信息融合图像确定注意力变换后图像。
再次参见图2,在得到窗口间信息融合图像后,可以将窗口间信息融合图像,按对应的原窗口图像的位置进行拼接,得到拼接图像。该拼接图像可继续经Transformer模型的其他网络层(例如LN层、MLP层)处理,并可以将Transformer模型输出的处理结果称为注意力变换后图像。
S490、将注意力变换后图像进行下采样,得到新的待处理图像,直至注意力变换后图像为目标特征图像为止。
其中,注意力变换后图像可以认为是将下采样后的特征图像,进行自注意 力变换的图像。通过将特征图像进行自注意力变换,能够使特征图像有利于提高视觉任务的执行精度。
在执行视觉任务过程中,通常需要确定多层级特征图像,以综合不同层级的特征图像预测结果。因此本实施例中,可在得到注意力变换后图像之后继续进行下采样操作,得到更高层级的特征图像,且可将该更高层级的特征图像作为新的待处理图像,来再次执行S420-S480步骤。如此循环,直到确定出需求层级的特征图像,可将需求层级的特征图像进行注意力变换,将对应的注意力变换后图像作为目标特征图像。
示例性的,图5为本公开实施例二所提供的一种图像处理方法中图像处理网络的结构框图。参见图5,图像处理网络可以采用多阶段的分层结构,图中示出了4个分层结构。其中,每个阶段的分层结构可以包括一个下采样层(图中用Patch Merging表示)和本公开实施例提供的Transformer模型(图中用Sep-ViT Block表示)。原始图像(图中用Image表示)每经过一个分层结构处理,其图像尺寸将发生变化。例如,原始图像尺寸为H×W×3;第一个分层结构中下采样后图像的尺寸为第二个分层结构中下采样后图像的尺寸为第三个分层结构中下采样后图像的尺寸为第四个分层结构中下采样后图像的尺寸为第四个分层结构输出的图像将继续进行后续处理,以得到目标特征图像,并根据目标特征图像执行相应的视觉任务。
本公开实施例的技术方案,可应用于下述至少一种图像处理网络:图像分类网络、图像分割网络和图像检测网络,能够实现在图像分类、语义分割和目标检测等各种视觉任务上达到领域内较为先进水平的同时,使模型更加轻量化。
并且,图像处理网络可以采用多阶段的分层结构,每个阶段可以包括一个下采样层和本公开实施例提供的Transformer模型,从而可以提取原始图像的目标特征图像。此外,还可以通过Transformer模型根据历史中已处理图像对应的窗口特征数据,对当前处理的原始图像的各窗口图像配置对应的待学习的窗口特征数据,从而可以实现窗口特征数据的迭代学习优化。
本公开实施例提供的图像处理方法与上述实施例提供的图像处理方法属于同一公开构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且相同的技术特征在本实施例与上述实施例中具有相同的有益效果。
实施例三
图6为本公开实施例三所提供的一种图像处理装置的结构示意图。本实施例提供的图像处理装置适用于通过基于窗口的Transformer模型,对待处理图像进行窗口内信息交流和窗口间信息交换的情形。
如图6所示,本公开实施例提供的图像处理装置,可以包括:
Transformer模型,用于执行下述模块的功能:
组合模块610,设置为将待处理图像的多个窗口图像,以及与每个窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;
第一自注意力模块620,设置为针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的窗口特征数据;
权重确定模块630,设置为根据多个学习完成的窗口特征数据,确定多个窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从多个所述组合数据得到;
第二自注意力模块640,设置为根据影响权重,对多个窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
在一些可选的实现方式中,第一自注意力模块620,可以设置为通过如下方式针对每个组合数据,将所述组合数据进行像素维度的自注意力变换:
针对每个组合数据,将所述组合数据进行矩阵变换,得到与所述组合数据对应的第一查询向量、第一键向量和第一值向量;
根据所述第一查询向量和所述第一键向量确定第一注意力图,并将所述第一注意力图和所述第一值向量相乘。
在一些可选的实现方式中,权重确定模块630,可以设置为通过如下方式根据多个学习完成的窗口特征数据,确定多个窗口内信息交流图像之间的影响权重:
将多个学习完成的窗口特征数据进行拼接,得到拼接数据;
将拼接数据经层规范化和激活处理后进行矩阵变换,得到第二查询向量和第二键向量;
根据第二查询向量和第二键向量确定第二注意力图,并根据第二注意力图确定多个窗口内信息交流图像之间的影响权重。
在一些可选的实现方式中,第二自注意力模块640,可以设置为通过如下方式根据影响权重,对多个窗口内信息交流图像分别进行窗口维度的自注意力变换:
针对每个窗口内信息交流图像,根据影响权重,确定其他窗口内信息交流图像对本窗口内信息交流图像的目标影响权重;
将其他窗口内信息交流图像与对应的目标影响权重进行相乘,并将各相乘结果融入本窗口内信息交流图像中。
在一些可选的实现方式中,Transformer模型,还可以包括:
合并模块,设置为在得到多个组合数据之后,将多个组合数据进行分组合并,得到多个合并窗口数据;
相应的,第一自注意力模块620,可以设置为:针对每个合并窗口数据,将合并窗口数据进行像素维度的自注意力变换。
在一些可选的实现方式中,图像处理装置可应用于下述至少一种图像处理网络:图像分类网络、图像分割网络和图像检测网络。
在一些可选的实现方式中,图像处理装置,还可以包括:
下采样层,设置为在通过Transformer模型将待处理图像的多个窗口图像,以及与多个窗口图像分别对应的待学习的窗口特征数据进行组合之前,对输入的原始图像进行下采样,得到待处理图像;
通过Transformer模型,利用滑动窗口在待处理图像上滑动,得到互不重叠的多个窗口图像;
根据与已处理图像对应的学习完成的窗口特征数据,确定与多个所述窗口图像分别对应的待学习的窗口特征数据。
在一些可选的实现方式中,Transformer模型,还用于在通过Transformer模型得到窗口间信息融合图像之后,根据窗口间信息融合图像确定注意力变换后图像;
通过下采样层,将注意力变换后图像进行下采样,得到新的待处理图像,直至注意力变换后图像为目标特征图像为止。
本公开实施例所提供的图像处理装置,可执行本公开任意实施例所提供的图像处理方法,具备执行方法相应的功能模块和有益效果。
值得注意的是,上述装置所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。
实施例四
下面参考图7,示出了适于用来实现本公开实施例的电子设备(例如图7中的终端设备或服务器)700的结构示意图。本公开实施例中的终端设备可以包括移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(television,TV)、台式计算机等等的固定终端。图7示出的电子设备仅仅是一个示例。
如图7所示,电子设备700可以包括处理装置(例如中央处理器、图形处理器等)701,电子设备700可以根据存储在只读存储器(Read-Only Memory,ROM)702中的程序或者从存储装置706加载到随机访问存储器(Random Access Memory,RAM)703中的程序而执行各种适当的动作和处理。在RAM 703中,还存储有电子设备700操作所需的各种程序和数据。处理装置701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(Input/Output,I/O)接口705也连接至总线704。
通常,以下装置可以连接至I/O接口705:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置706;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置707;包括例如磁带、硬盘等的存储装置708;以及通信装置709。通信装置709可以允许电子设备700与其他设备进行无线或有线通信以交换数据。虽然图7示出了具有各种装置的电子设备700,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,该计算机程序产品包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置709从网络上被下载和安装,或者从存储装置706被安装,或者从ROM702被安装。在该计算机程序被处理装置701执行时,执行本公开实施例的图像处理方法中的上述功能。
本公开实施例提供的电子设备与上述实施例提供的图像处理方法属于同一公开构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。
实施例五
本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的图像处理方法。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器、只读存储器、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)或闪存(FLASH)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(Hyper Text Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:
通过Transformer模型执行下述步骤:将待处理图像的多个窗口图像,以及与多个窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;针对每个组合数据,将所述组合数据分别进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的窗口特征数据;根据多个学习完成的窗口特征数据,确定多个得到窗口内信息交流图像之间的影响权重;根据影响权重,对所述多个窗口内信息交流图像分别进行窗口维度的自注意力变换,得到 窗口间信息融合图像。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元、模块的名称在某种情况下并不构成对该单元、模块本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质可以包括 基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,【示例一】提供了一种图像处理方法,该方法包括:
通过Transformer模型执行下述步骤:
将待处理图像的多个窗口图像,以及与多个所述窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;
针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的所述窗口特征数据;
根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从多个所述组合数据得到;
根据所述影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
根据本公开的一个或多个实施例,【示例二】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,针对每个组合数据,所述将所述组合数据进行像素维度的自注意力变换,包括:
针对每个组合数据,将所述组合数据进行矩阵变换,得到与所述组合数据对应的第一查询向量、第一键向量和第一值向量;
根据所述第一查询向量和所述第一键向量确定第一注意力图,并将所述第一注意力图和所述第一值向量相乘。
根据本公开的一个或多个实施例,【示例三】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,所述根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,包括:
将多个学习完成的所述窗口特征数据进行拼接,得到拼接数据;
将所述拼接数据经层规范化和激活处理后进行矩阵变换,得到第二查询向量和第二键向量;
根据所述第二查询向量和所述第二键向量确定第二注意力图,并根据所述 第二注意力图确定多个所述窗口内信息交流图像之间的影响权重。
根据本公开的一个或多个实施例,【示例四】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,所述根据所述影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,包括:
针对每个窗口内信息交流图像,根据所述影响权重,确定其他窗口内信息交流图像对本窗口内信息交流图像的目标影响权重;
将所述其他窗口内信息交流图像与对应的目标影响权重进行相乘,并将各相乘结果融入所述本窗口内信息交流图像中。
根据本公开的一个或多个实施例,【示例五】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,在所述得到多个组合数据之后,所述方法还包括:将所述多个组合数据进行分组合并,得到多个合并窗口数据;
所述针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,包括:针对每个组合数据,将所述合并窗口数据进行像素维度的自注意力变换。
根据本公开的一个或多个实施例,【示例六】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,所述方法应用于下述至少一种图像处理网络:图像分类网络、图像分割网络和图像检测网络。
根据本公开的一个或多个实施例,【示例七】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,在通过Transformer模型将待处理图像的多个窗口图像,以及与每个所述窗口图像对应的待学习的窗口特征数据进行组合之前,所述方法还包括:
对输入的原始图像进行下采样,得到所述待处理图像;
通过所述Transformer模型,利用滑动窗口在所述待处理图像上滑动,得到互不重叠的所述多个窗口图像;
根据与已处理图像对应的学习完成的窗口特征数据,确定与多个所述窗口图像分别对应的待学习的窗口特征数据。
根据本公开的一个或多个实施例,【示例八】提供了一种图像处理方法,还包括:
在一些可选的实现方式中,在通过Transformer模型得到窗口间信息融合图像之后,所述方法还包括:
通过所述Transformer模型,根据所述窗口间信息融合图像确定注意力变换后图像;
将所述注意力变换后图像进行下采样,得到新的待处理图像,直至所述注意力变换后图像为目标特征图像为止。

Claims (11)

  1. 一种图像处理方法,包括:
    通过Transformer模型执行下述步骤:
    将待处理图像的多个窗口图像,以及与多个所述窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;
    针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的所述窗口特征数据;
    根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从多个所述组合数据得到;
    根据所述影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
  2. 根据权利要求1所述的方法,其中,所述针对每个组合数据,将所述多个组合数据进行像素维度的自注意力变换,包括:
    针对每个组合数据,将所述组合数据进行矩阵变换,得到与所述组合数据对应的第一查询向量、第一键向量和第一值向量;
    根据所述第一查询向量和所述第一键向量确定第一注意力图,并将所述第一注意力图和所述第一值向量相乘。
  3. 根据权利要求1所述的方法,其中,所述根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,包括:
    将多个学习完成的所述窗口特征数据进行拼接,得到拼接数据;
    将所述拼接数据经层规范化和激活处理后进行矩阵变换,得到第二查询向量和第二键向量;
    根据所述第二查询向量和所述第二键向量确定第二注意力图,并根据所述第二注意力图确定多个所述窗口内信息交流图像之间的影响权重。
  4. 根据权利要求1所述的方法,其中,所述根据所述影响权重,对多个所述窗口内信息交流图像分别进行窗口维度的自注意力变换,包括:
    针对每个窗口内信息交流图像,根据所述影响权重,确定其他窗口内信息交流图像对本窗口内信息交流图像的目标影响权重;
    将所述其他窗口内信息交流图像与对应的目标影响权重进行相乘,并将相乘结果融入所述本窗口内信息交流图像中。
  5. 根据权利要求1所述的方法,在所述得到多个组合数据之后,所述方法 还包括:将所述多个组合数据进行分组合并,得到多个合并窗口数据;
    所述针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,包括:
    针对每个合并窗口数据,将所述合并窗口数据进行像素维度的自注意力变换。
  6. 根据权利要求1所述的方法,其中,所述方法应用于下述至少一种图像处理网络:图像分类网络、图像分割网络和图像检测网络。
  7. 根据权利要求6所述的方法,在通过Transformer模型将待处理图像的多个窗口图像,以及与多个所述窗口图像分别对应的待学习的窗口特征数据进行组合之前,所述方法还包括:
    对输入的原始图像进行下采样,得到所述待处理图像;
    通过所述Transformer模型,利用滑动窗口在所述待处理图像上滑动,得到互不重叠的所述多个窗口图像;
    根据与已处理图像对应的学习完成的窗口特征数据,确定与多个所述窗口图像分别对应的待学习的窗口特征数据。
  8. 根据权利要求6所述的方法,在通过Transformer模型得到窗口间信息融合图像之后,所述方法还包括:
    通过所述Transformer模型,根据所述窗口间信息融合图像确定注意力变换后图像;
    将所述注意力变换后图像进行下采样,得到新的待处理图像,直至所述注意力变换后图像为目标特征图像为止。
  9. 一种图像处理装置,包括:
    Transformer模型,用于执行下述模块的功能:
    组合模块,设置为将待处理图像的多个窗口图像,以及与多个所述窗口图像分别对应的待学习的窗口特征数据进行组合,得到多个组合数据;
    第一自注意力模块,设置为针对每个组合数据,将所述组合数据进行像素维度的自注意力变换,得到窗口内信息交流图像和学习完成的所述窗口特征数据;
    权重确定模块,设置为根据多个学习完成的所述窗口特征数据,确定多个所述窗口内信息交流图像之间的影响权重,其中,多个学习完成的所述窗口特征数据分别从多个所述组合数据得到,且多个所述窗口内信息交流图像分别从 多个所述组合数据得到;
    第二自注意力模块,设置为根据所述影响权重,对所述多个窗口内信息交流图像分别进行窗口维度的自注意力变换,得到窗口间信息融合图像。
  10. 一种电子设备,所述电子设备包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-8中任一所述的图像处理方法。
  11. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-8中任一所述的图像处理方法。
PCT/CN2023/081573 2022-03-24 2023-03-15 一种图像处理方法、装置、电子设备及存储介质 WO2023179420A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210301372.2A CN115953654A (zh) 2022-03-24 2022-03-24 一种图像处理方法、装置、电子设备及存储介质
CN202210301372.2 2022-03-24

Publications (1)

Publication Number Publication Date
WO2023179420A1 true WO2023179420A1 (zh) 2023-09-28

Family

ID=87288153

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081573 WO2023179420A1 (zh) 2022-03-24 2023-03-15 一种图像处理方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN115953654A (zh)
WO (1) WO2023179420A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065576A (zh) * 2021-02-26 2021-07-02 华为技术有限公司 一种特征提取的方法以及装置
CN113706642A (zh) * 2021-08-31 2021-11-26 北京三快在线科技有限公司 一种图像处理方法及装置
CN113870258A (zh) * 2021-12-01 2021-12-31 浙江大学 一种基于对抗学习的无标签胰腺影像自动分割系统
CN113901904A (zh) * 2021-09-29 2022-01-07 北京百度网讯科技有限公司 图像处理方法、人脸识别模型训练方法、装置及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065576A (zh) * 2021-02-26 2021-07-02 华为技术有限公司 一种特征提取的方法以及装置
CN113706642A (zh) * 2021-08-31 2021-11-26 北京三快在线科技有限公司 一种图像处理方法及装置
CN113901904A (zh) * 2021-09-29 2022-01-07 北京百度网讯科技有限公司 图像处理方法、人脸识别模型训练方法、装置及设备
CN113870258A (zh) * 2021-12-01 2021-12-31 浙江大学 一种基于对抗学习的无标签胰腺影像自动分割系统

Also Published As

Publication number Publication date
CN115953654A (zh) 2023-04-11

Similar Documents

Publication Publication Date Title
US20230394671A1 (en) Image segmentation method and apparatus, and device, and storage medium
CN110413812B (zh) 神经网络模型的训练方法、装置、电子设备及存储介质
WO2022012179A1 (zh) 生成特征提取网络的方法、装置、设备和计算机可读介质
CN112668588B (zh) 车位信息生成方法、装置、设备和计算机可读介质
CN113920307A (zh) 模型的训练方法、装置、设备、存储介质及图像检测方法
WO2022247562A1 (zh) 多模态数据检索方法、装置、介质及电子设备
WO2023179310A1 (zh) 图像修复方法、装置、设备、介质及产品
WO2022151876A1 (zh) 应用程序的测试控制方法、装置、电子设备及存储介质
WO2023185515A1 (zh) 特征提取方法、装置、存储介质及电子设备
CN114519667A (zh) 一种图像超分辨率重建方法及系统
CN112418249A (zh) 掩膜图像生成方法、装置、电子设备和计算机可读介质
US20230281956A1 (en) Method for generating objective function, apparatus, electronic device and computer readable medium
WO2024051655A1 (zh) 全视野组织学图像的处理方法、装置、介质和电子设备
WO2024016923A1 (zh) 特效图的生成方法、装置、设备及存储介质
CN113420757A (zh) 文本审核方法、装置、电子设备和计算机可读介质
WO2023202543A1 (zh) 文字处理方法、装置、电子设备及存储介质
WO2023179420A1 (zh) 一种图像处理方法、装置、电子设备及存储介质
EP4339836A1 (en) Network model compression method, apparatus and device, image generation method, and medium
WO2023169334A1 (zh) 图像的语义分割方法、装置、电子设备及存储介质
CN113778078A (zh) 定位信息生成方法、装置、电子设备和计算机可读介质
WO2024007938A1 (zh) 一种多任务预测方法、装置、电子设备及存储介质
CN111539524B (zh) 轻量级自注意力模块和神经网络构架的搜索方法
CN111814807B (zh) 用于处理图像的方法、装置、电子设备和计算机可读介质
WO2024061123A1 (zh) 一种图像处理方法及其相关设备
CN113283115B (zh) 图像模型生成方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23773668

Country of ref document: EP

Kind code of ref document: A1