CN114332574A

CN114332574A - Image processing method, device, equipment and storage medium

Info

Publication number: CN114332574A
Application number: CN202111601437.7A
Authority: CN
Inventors: 卢东焕; 何楠君; 魏东; 宁慕楠; 马锴; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-31
Filing date: 2021-12-24
Publication date: 2022-04-12

Abstract

The application relates to an image processing method, an image processing device, image processing equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: carrying out n-level down-sampling processing on the initial features of the target image to obtain n down-sampling features; acquiring an attention map of the target image based on a target down-sampling feature of the n down-sampling features; the target downsampling characteristics are obtained by the last-stage downsampling processing; processing the initial feature and the n down-sampling features respectively based on the attention map to obtain attention processing features corresponding to the initial feature and the n down-sampling features respectively; based on the attention processing feature, a processing result of performing a specified image processing task on the target image is acquired. The method and the device can ensure the accuracy of image feature extraction, and further improve the accuracy of the processing result of the image processing task.

Description

Image processing method, device, equipment and storage medium

The present application claims priority from chinese patent application having application number 202110877152.X entitled "image processing method, apparatus, device and storage medium", filed on 31/07/31/2021, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

Image processing tasks such as image segmentation and image classification are widely applied in the field of Computer Vision (CV).

In the related art, when an image processing task is executed, an input target image may be divided into a plurality of image blocks, and the plurality of image blocks are sequentially input to a Transformer (Transformer) encoder for encoding processing, so as to extract image features of the target image, and further, an image processing result is obtained according to the image features obtained by encoding.

However, the above scheme needs to block the image, which may cause the feature continuity of the image block edge to be damaged, thereby affecting the accuracy of feature extraction and further affecting the accuracy of the processing result of the target image.

Disclosure of Invention

The embodiment of the application provides an image processing method, device, equipment and storage medium, which can improve the accuracy of the image processing result. The technical scheme is as follows.

In one aspect, an image processing method is provided, and the method includes:

carrying out n-level down-sampling processing on the initial features of the target image to obtain n down-sampling features; n is a positive integer;

acquiring an attention map of the target image based on a target down-sampling feature of the n down-sampling features; the target downsampling feature is obtained by the last downsampling processing of the n-level downsampling processing;

processing the initial feature and the n down-sampling features respectively based on the attention map to obtain attention processing features corresponding to the initial feature and the n down-sampling features respectively;

and acquiring a processing result of executing a specified image processing task on the target image based on the attention processing features respectively corresponding to the initial features and the n down-sampling features.

In still another aspect, there is provided an image processing apparatus, the apparatus including:

the down-sampling module is used for carrying out n-level down-sampling processing on the initial features of the target image to obtain n down-sampling features;

an attention map acquisition module for acquiring an attention map of the target image based on a target down-sampling feature of the n down-sampling features; the target downsampling feature is obtained by the last downsampling processing of the n-level downsampling processing;

an attention processing module, configured to process the initial feature and the n downsampled features based on the attention map, to obtain attention processing features corresponding to the initial feature and the n downsampled features, respectively;

and the result acquisition module is used for acquiring a processing result of executing a specified image processing task on the target image based on the attention processing characteristics respectively corresponding to the initial characteristics and the n down-sampling characteristics.

In one possible implementation manner, the attention processing module 703 includes:

the up-sampling unit is used for up-sampling the attention diagram to obtain a first attention diagram with a first scale; the first scale is a scale of a first feature, the first feature being any one of the initial feature and the n down-sampled features;

and the processing unit is used for carrying out matrix multiplication processing on the first attention diagram and the first characteristic to obtain an attention processing characteristic corresponding to the first characteristic.

In one possible implementation, the attention map acquisition module is configured to,

acquiring the attention diagram based on the query dimensional features and the key dimensional features of the target downsampling features;

and the processing unit is used for carrying out matrix multiplication processing on the first attention diagram and the value dimension characteristic of the first characteristic to obtain an attention processing characteristic corresponding to the first characteristic.

In one possible implementation manner, the result obtaining module includes:

a fusion unit, configured to fuse attention processing features corresponding to the initial feature and the n downsampling features, respectively, to obtain an image feature of the target image;

a result acquisition unit configured to acquire the processing result based on the image feature.

In a possible implementation manner, the fusion unit is configured to,

the attention processing features corresponding to the n down-sampling features are up-sampled respectively to obtain n up-sampling features; the scale of the upsampled feature is the same as the scale of the initial feature;

and cascading the initial features and the n up-sampling features to obtain the image features.

In one possible implementation, a downsampling module is configured to,

performing down-sampling processing on the second characteristic to obtain an intermediate sampling characteristic; the second feature is any one of the initial feature and the n down-sampled features except for the target down-sampled feature;

and performing convolution processing on the intermediate sampling feature to obtain a next-stage down-sampling feature of the second feature.

In one possible implementation, a downsampling module is configured to,

performing convolution processing on the third characteristic to obtain an intermediate convolution characteristic; the third feature is any one of the initial feature and the n down-sampled features except for the target down-sampled feature;

and performing downsampling processing on the intermediate convolution characteristic to obtain a next-stage downsampling characteristic of the third characteristic.

In yet another aspect, a computer device is provided, which includes a processor and a memory, where at least one computer instruction is stored, and the at least one computer instruction is loaded and executed by the processor to implement the image processing method.

In yet another aspect, a computer-readable storage medium is provided, in which at least one computer instruction is stored, the at least one computer instruction being loaded and executed by a processor to implement the image processing method described above.

In yet another aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the image processing method.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the method comprises the steps of carrying out multi-stage down-sampling on initial features of a target image, calculating an attention map through features obtained by the last stage of down-sampling, and processing the initial features and all stages of down-sampling features through the attention map to extract the features of the target image.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a system configuration diagram of an image processing system according to various embodiments of the present application;

FIG. 2 is a flow diagram illustrating an image processing method according to an exemplary embodiment;

FIG. 3 is a diagram illustrating an image processing framework in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating an image processing method according to an exemplary embodiment;

FIG. 5 is a frame diagram of the DAB architecture to which the embodiment of FIG. 4 relates;

FIG. 6 is a block diagram of an image processing model according to the embodiment shown in FIG. 4;

fig. 7 is a block diagram showing a configuration of an image processing apparatus according to an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Before describing the various embodiments shown herein, several concepts related to the present application will be described.

Referring to fig. 1, a system configuration diagram of an image processing system according to various embodiments of the present application is shown. As shown in fig. 1, the system includes an image capture device 120, a terminal 140, and a server 160; optionally, the system may further include a database 180.

The image capture device 120 may be a camera device or a camera device for capturing images. For example, in the medical field, the image acquired by the image acquisition apparatus 120 may be a medical image containing blood vessels or biological tissues, such as a fundus image (containing blood vessels under the retina), a gastroscopic image, an enteroscopic image, an intra-oral image, and the like. Besides the medical field, the scheme of the embodiment of the application can also be applied to other fields, such as automatic driving, information searching and the like.

The image capturing device 120 may include an image output Interface, such as a Universal Serial Bus (USB) Interface, a High Definition Multimedia Interface (HDMI) Interface, or an ethernet Interface; alternatively, the image output interface may be a Wireless interface, such as a Wireless Local Area Network (WLAN) interface, a bluetooth interface, or the like.

Accordingly, according to the type of the image output interface, the operator may export the image captured by the image capturing device 120 in various ways, for example, importing the image to the terminal 140 through a wired or short-distance wireless manner, or importing the image to the terminal 140 or the server 160 through a local area network or the internet.

The terminal 140 may be a terminal device with certain processing capability and interface display function, for example, the terminal 140 may be a mobile phone, a tablet computer, an e-book reader, smart glasses, a laptop computer, a desktop computer, and the like.

The terminal 140 may include a terminal used by a developer or a user, for example, in the medical field, the terminal 140 may be a terminal used by a medical staff.

When the terminal 140 is implemented as a terminal used by a developer, the developer can develop a machine learning model for performing a specified image processing task on an image through the terminal 140 and deploy the machine learning model to the server 160 or the terminal used by the user.

When the terminal 140 is implemented as a terminal used by a user (such as a medical staff), an application program for acquiring and presenting a processing result of an image may be installed in the terminal 140, and after the terminal 140 acquires the image acquired by the image acquisition device 120, the terminal may acquire the processing result obtained by performing a specified image processing task on the image through the application program and present the processing result.

In the system shown in fig. 1, the terminal 140 and the image capture device 120 are physically separate physical devices. Optionally, in another possible implementation manner, when the terminal 140 is implemented as a terminal used by a user, the terminal 140 and the image capturing device 120 may also be integrated into a single entity device; for example, the terminal 140 may be a terminal device for image capturing function.

The server 160 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

For example, when the solution shown in the present application is applied to the medical field, it can be implemented as a part of the medical cloud. The Medical cloud (Medical cloud) is a Medical health service cloud platform created by using cloud computing on the basis of new technologies such as cloud computing, mobile technology, multimedia, wireless communication, big data and internet of things and combining Medical technology, and Medical resource sharing and Medical range expansion are achieved. Due to the combination of the cloud computing technology, the medical cloud improves the efficiency of medical institutions and brings convenience to residents to see medical advice. Like the appointment register, the electronic medical record, the medical insurance and the like of the existing hospital are all products combining cloud computing and the medical field, and the medical cloud also has the advantages of data security, information sharing, dynamic expansion and overall layout.

The server 160 may be a server that provides a background service for an application installed in the terminal 140, and the background server may be version management of the application, perform background processing on an image acquired by the application and return a processing result, perform background training on a machine learning model developed by a developer, and the like.

The database 180 may be a Redis database, or may be another type of database. The database 180 is used for storing various types of data.

Optionally, the terminal 140 and the server 160 are connected via a communication network. Optionally, the image capturing device 120 is connected to the server 160 through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the system may further include a management device (not shown in fig. 1), which is connected to the server 160 through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but can be any Network including, but not limited to, any combination of a LAN (Local Area Network), a MAN (Metropolitan Area Network), a WAN (Wide Area Network), a mobile, wireline or wireless Network, a private Network, or a virtual private Network. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HTML (HyperText Mark-up Language), XML (Extensible Markup Language), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as SSL (Secure Socket Layer), TLS (Transport Layer Security), VPN (Virtual Private Network), IPsec (Internet Protocol Security), and the like. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

FIG. 2 is a flow diagram illustrating an image processing method according to an exemplary embodiment. The method may be performed by a computer device, for example, the computer device may be a server, or the computer device may also be a terminal, or the computer device may include a server and a terminal, where the server may be the server 160 in the embodiment shown in fig. 1 and the terminal may be the terminal 140 in the embodiment shown in fig. 1. As shown in fig. 2, the image processing method may include the following steps.

Step 201, performing n-level down-sampling processing on the initial features of the target image to obtain n down-sampling features; n is a positive integer.

The n-level down-sampling is performed based on the characteristics of the upper layer.

For example, the first-level downsampling refers to downsampling the initial feature to obtain a downsampling feature corresponding to the first-level downsampling; the second-level down-sampling refers to performing down-sampling processing on the down-sampling feature corresponding to the first-level down-sampling to obtain the down-sampling feature corresponding to the second-level down-sampling, and so on.

Step 202, acquiring an attention diagram of a target image based on a target downsampling feature in n downsampling features; the target down-sampling feature is obtained by the last down-sampling processing of the n-level down-sampling processing.

In the embodiment of the application, the computer device can calculate the attention map of the target image through the down-sampling feature obtained by the down-sampling of the last stage (namely, the target down-sampling feature), so that the calculation amount of the attention map for calculating the whole target image is reduced, and the calculation efficiency is improved.

Step 203, based on the attention map, processes the initial feature and the n down-sampling features respectively, and obtains attention processing features corresponding to the initial feature and the n down-sampling features respectively.

In the embodiment of the application, since the attention map is calculated by the down-sampling feature obtained by the down-sampling of the last stage, in order to extract the feature of the whole target image as completely as possible, the initial feature and the down-sampling features of each stage are respectively processed by the attention map, so that the feature loss in the feature extraction process is reduced.

And 204, acquiring a processing result of executing a specified image processing task on the target image based on the attention processing features respectively corresponding to the initial features and the n down-sampling features.

In this embodiment of the application, after the computer device obtains the attention processing features corresponding to the initial feature and the n downsampling features, a specific image processing task may be executed on the target image based on the attention processing features corresponding to the initial feature and the n downsampling features, so as to obtain a processing result.

In summary, according to the scheme shown in the embodiment of the present application, the initial features of the target image are subjected to multi-stage down-sampling, the attention map is calculated from the features obtained by the last stage of down-sampling, and then the initial features and the down-sampling features of each stage are processed through the attention map to extract the features of the target image.

In a possible implementation manner, the scheme shown in the embodiments of the present application can be implemented based on an AI (Artificial Intelligence) technology, and can implement any type of image processing task in the field of computer vision. That is, the steps in the embodiment shown in fig. 2 can implement the specified image processing task by the pre-trained image processing model. For example, the designated image processing tasks may include, but are not limited to, image classification, image segmentation, object detection, and the like.

The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer vision is a science for researching how to make a machine "see", and more specifically, it refers to that a camera and a computer are used to replace human eyes to perform machine vision such as identification and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

For example, taking an image classification task in the medical field (e.g., identifying whether a tissue organ in a medical image is normal) as an example, please refer to fig. 3, which illustrates an image processing framework provided by an exemplary embodiment of the present application. As shown in fig. 3, the computer device may extract initial features of the medical image, and input into the image processing model 30, the down-sampling branch 31 in the image processing model 30 carries out n-level down-sampling to obtain the down-sampling characteristic corresponding to each level of down-sampling, where the initial features and the down-sampled features at each level are input into the attention branch 32 in the image processing model 30, the attention branch extracts an attention map based on the last stage of down-sampling features, processes the initial features and each stage of down-sampling features through the attention map to obtain attention processing features respectively corresponding to the initial features and each stage of down-sampling features, the attention processing feature is input to the classification branch 33 in the image processing model 30, and the classification result of the medical image, for example, the probability of whether the tissue organ in the medical image is normal or not, is output from the classification branch 33.

In the embodiment shown in fig. 2, the steps 201 to 203 may be implemented by a neural network, and in this embodiment, the neural network may be referred to as a Dense convolutional network (DenseNet) structure based on Attention mechanism (Attention) (DAB).

FIG. 4 is a flowchart illustrating an image processing method according to an exemplary embodiment. The method may be performed by a computer device, for example, the computer device may be a server, or the computer device may also be a terminal, or the computer device may include a server and a terminal, where the server may be the server 160 in the embodiment shown in fig. 1 and the terminal may be the terminal 140 in the embodiment shown in fig. 1. As shown in fig. 4, the image processing method may include the following steps.

Step 401, acquiring initial features of a target image.

In an exemplary aspect of the embodiment of the present application, the computer device may perform convolution processing on the target image through a convolution layer in the image processing model to obtain an initial feature of the target image.

For example, please refer to fig. 5, which shows a frame diagram of a DAB structure according to an embodiment of the present application. The input of the DAB structure may be any three-dimensional or four-dimensional matrix, representing two-dimensional or three-dimensional images or features of any number of channels (N × M × C1 or N × M × L × C1, where N, M, L represent dimensions, and C1 represents the number of input channels). The output of the DAB structure is characterized by the same dimensions as the input (N × M × C2 or N × M × L × C2, with C2 representing the number of channels in the output). The following embodiments of the present application will be described by taking the processing of a two-dimensional image as an example.

As shown in fig. 5, for the target image, the computer device may process through the convolutional layer in the DAB structure in the image processing model to obtain the initial features 51 of the target image.

For example, for an input multi-channel two-dimensional image or feature, the DAB structure shown in fig. 5 first extracts features (the number of channels may be different) that are the same as the input scale through a Convolutional Neural Network (CNN) layer. Here, the same feature as the input scale may be the above-described initial feature.

Step 402, performing n-level down-sampling processing on the initial features of the target image to obtain n down-sampling features; n is a positive integer.

In a possible implementation manner, the performing n-level downsampling processing on the initial feature to obtain n downsampled features may include:

performing down-sampling processing on the second characteristic to obtain an intermediate sampling characteristic; the second feature is any one of the initial feature and the n down-sampling features except for the target down-sampling feature;

That is, in one possible implementation manner of the embodiment of the present application, the downsampling process may include a downsampling process and a convolution process, and in the downsampling process, the computer device may perform downsampling on the feature to be downsampled (i.e., the second feature) and then perform convolution operation on a downsampling result to obtain a downsampling feature of a next stage.

In another possible implementation manner, the performing n-level downsampling processing on the initial feature to obtain n downsampled features may include:

performing convolution processing on the third characteristic to obtain an intermediate convolution characteristic; the third feature is any one feature except the target down-sampling feature among the initial feature and the n down-sampling features;

and performing downsampling processing on the intermediate convolution characteristic to obtain a next-stage downsampling characteristic of a third characteristic.

That is, in the downsampling process, the computer device may perform convolution operation on the feature to be downsampled (i.e., the third feature), and then downsample the convolution result to obtain the next-stage downsampling feature.

For example, in the DAB structure shown in fig. 5, after downsampling the initial feature 51, the computer device may continue to extract the next-level downsampled feature 52 through another convolutional layer, and this operation may be repeated any number of times before the image scale is reduced to 1, where the repetition operation is set to 3 times in fig. 5 (i.e., n is 3). Here, the convolution kernel scale of the CNN layer may be 3 × 3, and the downsampled kernel scale may be 2 × 2, that is, the length and width of each layer of downsampled features 52 are half of the previous-level features.

Alternatively, in the DAB structure shown in fig. 5, the computer device may perform convolution operation on the initial features and then perform downsampling to obtain the downsampled features of the next stage, and this operation may also be repeated 3 times, where the length and width of each layer of downsampled features are half of those of the previous stage.

Step 403, acquiring an attention diagram of the target image based on the target downsampling feature in the n downsampling features; the target down-sampling feature is obtained by the last down-sampling processing of the n-level down-sampling processing.

In an embodiment of the application, the computer device may obtain an attention map based on query dimensional features and key dimensional features of the target downsampling features.

As shown in fig. 5, after the final-stage downsampling feature 52 is obtained, an Attention Map 53(Attention Map) may be calculated based on the final-stage downsampling feature 52. In the embodiment of the present application, attention calculation in the transform can be used for reference, that is, for a given last-stage down-sampling feature y ∈ R^N×M×CAnd obtaining query dimension characteristics q, key dimension characteristics k and value dimension characteristics v through three parallel convolutional neural networks. For convenience of calculation, q, k, v may be converted into a two-dimensional matrix, where each row in the two-dimensional matrix corresponds to a pixel in the down-sampling feature y, that is, each pixel in the down-sampling feature y is a pixel in the down-sampling feature y

Then in the attention map 53, the attention of pixel point j to pixel point i is:

wherein Softmax stands for normalization function, T stands for matrix transposition,. stands for matrix multiplication; c. C_q、c_k、c_vAre respectively provided withRepresenting the number of channels q, k, v.

Step 404, based on the attention map, processes the initial feature and the n down-sampling features respectively to obtain attention processing features corresponding to the initial feature and the n down-sampling features respectively.

In a possible implementation manner, the processing the initial feature and the n down-sampled features based on the attention map to obtain attention processing features corresponding to the initial feature and the n down-sampled features respectively may include:

upsampling the attention diagram to obtain a first attention diagram with a first scale; the first scale is the scale of a first feature, and the first feature is an initial feature and any one of the n down-sampled features;

and performing matrix multiplication processing on the first attention diagram and the first characteristic to obtain an attention processing characteristic corresponding to the first characteristic.

In the embodiment of the application, the attention force diagram is obtained by calculation of the last-stage down-sampling feature, therefore, the scale of the attention force diagram is the same as that of the last-stage down-sampling feature, and the scales of the initial feature to the (n-1) th-stage down-sampling feature are larger than that of the attention force diagram, in order to ensure the correctness of the attention force mechanism, the attention force diagram needs to be up-sampled firstly, so that the scale of the attention force diagram is the same as that of the feature to be matrix-multiplied (i.e. the first feature), and the attention force mechanism operation can be correctly performed.

For the last-stage down-sampling feature, the core size of the up-sampling of the corresponding attention diagram may be 1 × 1, that is, the attention diagram is directly attention-calculated with the last-stage down-sampling feature.

In a possible implementation manner, the above-mentioned process of performing matrix multiplication processing on the first attention map and the first feature to obtain the attention processing feature corresponding to the first feature may include:

and performing matrix multiplication processing on the first attention diagram and the value dimension characteristic of the first characteristic to obtain an attention processing characteristic corresponding to the first characteristic.

In the embodiment of the present application, after obtaining the attention map a, the attention map a is applied to the value dimension feature v, that is, the feature SA after attention processing can be obtained:

SA＝A·v

that is, the target downsampling feature is a feature of the last layer of downsampling, and multiple layers of downsampling have already been performed. In order to provide features with sufficient spatial precision for subsequent tasks, the embodiment of the application upsamples the attention map a to the same scale as the features of the previous layers, and multiplies the upsampled features by the matrixes of the features of the previous layers respectively to obtain the initial features and attention processing features respectively corresponding to the downsampled features of each level.

For example, in fig. 5, after the computer device obtains the attention map 53 through the calculation of the last-stage down-sampling feature 52, for the last-stage down-sampling feature 52, the attention map 53 is directly matrix-multiplied by the value dimension feature v of the last-stage down-sampling feature 52 to obtain the attention processing feature 54 corresponding to the last-stage down-sampling feature 52; the initial feature 51 and the two preceding stages of downsampling features 52 are respectively upsampled to the same scale by the attention map 53, and then matrix-multiplied by the value dimension feature v of the corresponding feature to obtain the attention processing features 54 corresponding to the initial feature 51 and the two preceding stages of downsampling features 52.

For example, in fig. 5, since the attention map 53 is calculated by the last-stage down-sampling feature 52, the scale of the attention map 53 is the same as that of the last-stage down-sampling feature 52 (for example, both scales are 100 × 100), and when the attention calculation is performed on the last-stage down-sampling feature 52, it is not necessary to up-sample the attention map 53, and the value dimensional feature of the last-stage down-sampling feature 52 is directly subjected to matrix multiplication with the attention map 53. Since the last-stage down-sampling feature 52 is obtained by performing three-stage down-sampling on the initial feature 51, the scale of the attention map 53 is smaller than the scale of the initial feature 51 and the scale of the first-stage down-sampling feature 52, and therefore, the attention map 53 needs to be up-sampled to the same scale as the initial feature 51 and the scale of the first-stage down-sampling feature 52, and then matrix multiplication needs to be performed on the initial feature 51 and the first-stage down-sampling feature 52, for example, taking the core scale of the three-stage down-sampling as 2 ×, for example, when performing attention calculation on the second-stage down-sampling feature 52, the attention map 53 is up-sampled to 200 ×, and then performs attention calculation on the second-stage down-sampling feature 52 (for example, attention calculation is performed on the value dimension feature of the second-stage down-sampling feature 52); when attention is calculated for the first-level down-sampling feature 52, after the scale of the attention map 53 is up-sampled to 400 x 400, attention is calculated with the first-level down-sampling feature 52; in the attention calculation of the initial feature 51, the attention calculation is performed with the initial feature 51 after up-sampling the scale of the attention map 53 to 800 × 800.

For the downsampling feature 52 of the second stage, the computer device performs upsampling with a kernel scale of 2 × 2 on the basis of the attention map 53 to obtain an upsampled attention map corresponding to the downsampling feature 52 of the second stage; for the downsampling feature 52 of the first stage, the computer device performs upsampling with a kernel scale of 4 × 4 on the basis of the attention map 53 to obtain an upsampled attention map corresponding to the downsampling feature 52 of the first stage; for the initial features, the computer device performs upsampling with a kernel scale of 8 × 8 on the basis of the attention map 53, resulting in an upsampled attention map corresponding to the initial features 51.

Alternatively, the above up-sampling of the attention map 53 may be performed in a stepwise manner on the attention map 53. For example, taking the core scale of the downsampling as 2 × 2 as an example, for the downsampling feature 52 of the second stage, the computer device performs upsampling with the core scale of 2 × 2 on the basis of the attention map 53 to obtain an upsampled attention map corresponding to the downsampling feature 52 of the second stage; for the down-sampling feature 52 of the first level, the computer device performs up-sampling with a kernel scale of 2 × 2 on the basis of the up-sampled attention map corresponding to the down-sampling feature 52 of the second level, to obtain an up-sampled attention map corresponding to the down-sampling feature 52 of the first level; for the initial feature, the computer device performs upsampling with a kernel scale of 2 × 2 on the basis of the upsampled attention map corresponding to the downsampled feature 52 of the first stage, to obtain an upsampled attention map corresponding to the initial feature 51.

In the embodiment of the present application, the scale of the attention map a obtained by the final calculation is NM × NM, and the video memory is easily insufficient after upsampling. In order to reduce the requirement on video memory, the attention diagram can be calculated by introducing axis attention in the embodiment of the application. The basic principle is that the horizontal part and the vertical part of the attention map A are divided and calculated respectively.

Taking the horizontal attention map as an example, after obtaining the query dimension feature q and the key dimension feature k of the target downsampling feature, the computer device may average the q and k of different pixel points in each column in the target downsampling feature, calculate the attention value of each column in the attention map according to the obtained average value of q and k, and apply the obtained attention value of each column in the attention map to the value dimension feature of each pixel in each column in the corresponding target downsampling feature when performing the attention mechanism operation. The same is done for the initial feature and other down-sampled features. The calculation of the longitudinal attention map can be analogized. The scale of the attention map can be reduced from NM × NM to N × N + M × M, thereby greatly reducing the need for video memory.

In addition, the attention mechanism part in the embodiment of the application can be realized by adopting a multi-head attention mechanism. The multi-head attention mechanism is one of the cores of the Transformer, and the multi-head attention mechanism can also be applied to the DAB framework in the embodiment of the present application, for example, the same feature can be input into a plurality of blocks (blocks) connected in parallel, and features output by the blocks are cascaded together to serve as the input of the next layer.

That is to say, the DAB framework in the embodiment of the present application is a model framework constructed based on a multi-head attention mechanism, where the multi-head attention mechanism includes multiple sets (Q, K, V) of matrices, and a set (Q, K, V) of matrices represents an operation of the multi-head attention mechanism, and after the multiple matrices are spliced together, the result is multiplied by a projection matrix, so that a final output of the multi-head attention mechanism, that is, an attention processing feature in the embodiment of the present application, can be obtained.

After obtaining the attention processing features, the computer device may obtain a processing result of performing a specified image processing task on the target image based on the attention processing features respectively corresponding to the initial features and the n down-sampling features. The process may refer to the subsequent steps.

And step 405, fusing the attention processing features respectively corresponding to the initial features and the n down-sampling features to obtain the image features of the target image.

In the embodiment of the present application, the DAB framework may concatenate the initial features and the attention processing features corresponding to the n downsampling features respectively as the final output (i.e. the image features of the target image) of the DAB framework.

In a possible implementation manner, the process of obtaining the image feature of the target image by fusing the attention processing features respectively corresponding to the initial feature and the n down-sampling features may include:

the attention processing features corresponding to the n down-sampling features are up-sampled respectively to obtain n up-sampling features; the scale of the up-sampling feature is the same as the scale of the initial feature;

In this embodiment of the application, since the scales of the attention processing features respectively corresponding to the initial feature and the n down-sampling features are different, in order to implement fusion of a plurality of attention processing features, the computer device may perform up-sampling on the attention processing features respectively corresponding to the n down-sampling features to obtain up-sampling features having n scales that are the same as the scales of the initial feature, and then cascade the n up-sampling features and the initial feature, that is, may obtain an image feature.

For example, in fig. 5, the scale of the attention processing feature 54 corresponding to each of the n downsampled features in the DAB frame is smaller than the scale of the attention processing feature 54 corresponding to the initial feature, so that for the attention processing feature 54 corresponding to each of the n downsampled features, the DAB frame can perform upsampling processing by an upsampling layer to obtain n upsampled features, and after the upsampled features are concatenated with the attention processing feature 54 corresponding to the initial feature, an image feature 55 can be obtained, and the image feature 55 can be output as the DAB frame.

For example, in fig. 5, the scale of the attention processing feature 54 is upsampled for the purpose of making the scale of the upsampled feature obtained by upsampling each attention processing feature 54 the same, so that the subsequent cascading is possible. Here, the scale of the attention processing feature 54 corresponding to the initial feature is the highest, and at this time, the attention processing feature 54 corresponding to the initial feature may not be upsampled, and only the attention processing features 54 corresponding to the three-level downsampling features 52 need to be upsampled, so that the scales of the upsampling features 54 corresponding to the three-level downsampling features 52 are all the same as the scales of the attention processing features 54 corresponding to the initial feature 51. For example, taking the scale of the initial feature 51 as 800 × 800, the scale of the down-sampled kernel as 2 × 2, the scale of the attention-processing feature 54 corresponding to the initial feature as 800 × 800, and the scales of the attention-processing features 54 corresponding to the three-level down-sampled features 52 as 400 × 400, 200 × 200, and 100 × 100, respectively; at this time, the computer device may up-sample the scales of the attention processing features 54 corresponding to the three-level down-sampling features 52 to 800 × 800, and then perform stitching with the attention processing features 54 corresponding to the initial features 51, so as to obtain the image features 55.

In step 406, a processing result of performing a specified image processing task on the target image is obtained based on the image features.

After obtaining the image features, the computer device may execute a specified image processing task according to the image features to obtain a processing result.

In the embodiment of the present application, a single DAB frame may be used to extract image features, or a plurality of DAB frames may be used in a cascaded manner to extract final image features, wherein the output of each DAB frame is used as the input of the next layer.

For example, please refer to fig. 6, which shows a frame diagram of an image processing model according to an embodiment of the present application. As shown in fig. 6, taking the image processing model for image classification as an example, the image processing model includes 3 DAB frames and a Multi Layer Perceptron (MLP) network, wherein the 3 DAB frames and the MLP network are cascaded. The MLP can be composed of two full connection layers, and the features extracted by DAB can be input into the MLP after being leveled up, so that the final classification result can be obtained. After the medical image is input into the first DAB frame 61, the DAB frame 61 performs the processing shown in the above steps 401 to 405 on the medical image, and outputs the image features, the image features output by the DAB frame 61 are used as the input of the DAB frame 62 (for example, as the initial features of the DAB frame 62), and so on. The image features output by the DAB framework 63 will be input to the MLP network 64 and the image classification results (e.g., the probability of whether the tissue and organ in the output image are normal) will be output.

The image processing model can be trained through image samples labeled according to task types in advance. For example, taking image classification as an example, in a model training stage, the model training device may obtain a sample image labeled with normal samples and abnormal samples, input the sample image into the image processing model, output a prediction classification result of the sample image by the image processing model, then calculate a loss function by combining the prediction classification result and labeling information of the sample image, and adjust parameters of the image processing model by the loss function, for example, adjust and adjust parameters in 3 DAB frames and the MLP network in fig. 6. After multiple rounds of iterative training, a converged image processing model is obtained, and the converged image processing model can execute the scheme shown in the above embodiment of the present application.

The loss function may adopt a cross entropy (cross entropy) loss function, and for the result of the artificial labeling, the label of the normal sample is 0, the label of the abnormal sample is 1, and the loss function may be defined as:

L_ce＝-(ylog(p)+(1-y)log(1-p))

wherein y is a sample label of the sample image, and p is the probability that the sample image output by the image processing model is a positive sample. According to the loss function, parameters of the image processing model are updated by adopting a gradient descent method based on Adam, for example, beta in Adam is (0.95,0.9995) in a certain image classification task. The initial learning rate may be 0.001, one fifth of every 20 epochs (all samples go through the model calculation called 1 epoch), a total of 100 epochs may be trained, and the batch size may be 50.

When the scheme shown in the application is realized as an image classification network, the method can be used for classifying various medical images and even other images.

The DAB related in the embodiment of the present application is used as a basic module of a neural network, and the application is similar to the transform and CNN modules, and can be applied to feature extraction in most image processing tasks. A plurality of modules can be stacked to generate a deeper network, and characteristics with higher dimension can be extracted to cope with complex tasks. For any task such as image recognition, segmentation or detection, the CNN or Transformer module is replaced by the module to extract features, and the features are input into a classification head, a segmentation head or a detection head. The training process and the reasoning mode can both use the scheme corresponding to the task.

The number of layers, the core size, the number of characteristic channels and the down-sampling core size of the CNN network in DAB related to the embodiment of the application can be freely adjusted according to task requirements.

Besides the basic CNN, the feature extraction layer in the embodiment of the present application may also be replaced with other networks, such as a Residual Network (ResNet), an expanded CNN (scaled CNN), and the like.

The scheme shown in the above embodiments of the present application may be implemented or executed in combination with a block chain. For example, some or all of the steps in the above embodiments may be performed in a blockchain system; or, data required for executing each step in the above embodiments or generated data may be stored in the blockchain system; for example, training samples used for the model training, model input data such as a target image in the model application process, and the like can be acquired from the block chain system by computer equipment; for another example, the parameters of the model obtained after the model training (including the parameters of the DAB framework and the parameters of other parts in the image processing model) may be stored in the blockchain system.

It is understood that in the embodiments of the present application, related data of user information such as medical data and medical images are referred to, and when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Fig. 7 is a block diagram illustrating a configuration of an image processing apparatus according to an exemplary embodiment. The device can realize all or part of the steps in the method provided by the embodiment shown in fig. 2 or fig. 4, and the image processing device comprises:

a down-sampling module 701, configured to perform n-level down-sampling on the initial feature of the target image to obtain n down-sampling features;

an attention map obtaining module 702, configured to obtain an attention map of the target image based on a target downsampled feature of the n downsampled features; the target downsampling feature is obtained by the last downsampling processing of the n-level downsampling processing;

an attention processing module 703, configured to process the initial feature and the n downsampled features based on the attention map, to obtain attention processing features corresponding to the initial feature and the n downsampled features, respectively;

a result obtaining module 704, configured to obtain a processing result of performing a specified image processing task on the target image based on the attention processing features respectively corresponding to the initial feature and the n downsampling features.

In one possible implementation, the attention map acquisition module 702 is configured to,

In a possible implementation manner, the result obtaining module 704 includes:

In a possible implementation manner, the fusion unit is configured to,

In one possible implementation, the down-sampling module 701 is configured to,

FIG. 8 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment. The computer apparatus 800 includes a Central Processing Unit (CPU) 801, a system Memory 804 including a Random Access Memory (RAM) 802 and a Read-Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the Central Processing Unit 801. The computer device 800 also includes a basic input/output system 806 for facilitating information transfer between various components within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, flash memory or other solid state storage technology, CD-ROM, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

The computer device 800 may be connected to the internet or other network devices through a network interface unit 811 coupled to the system bus 805.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processing unit 801 executes the one or more programs to implement all or part of the steps of the method shown in any one of fig. 2 or fig. 4.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods shown in the various embodiments described above.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method according to claim 1, wherein the processing the initial feature and the n down-sampled features based on the attention map to obtain attention processing features corresponding to the initial feature and the n down-sampled features, respectively, comprises:

upsampling the attention diagram to obtain a first attention diagram with a first scale; the first scale is a scale of a first feature, the first feature being any one of the initial feature and the n down-sampled features;

and performing matrix multiplication processing on the first attention diagram and the first feature to obtain an attention processing feature corresponding to the first feature.

3. The method of claim 2, wherein obtaining the attention map of the target image based on a target downsampled feature of the n downsampled features comprises:

the matrix multiplication processing of the first attention map and the first feature is performed to obtain an attention processing feature corresponding to the first feature, and the processing method includes:

and carrying out matrix multiplication processing on the first attention diagram and the value dimension characteristic of the first characteristic to obtain an attention processing characteristic corresponding to the first characteristic.

4. The method according to claim 1, wherein the obtaining a processing result of performing a specified image processing task on the target image based on the attention processing features respectively corresponding to the initial feature and the n down-sampling features comprises:

fusing attention processing features respectively corresponding to the initial features and the n down-sampling features to obtain image features of the target image;

and acquiring the processing result based on the image characteristics.

5. The method according to claim 4, wherein the fusing attention processing features respectively corresponding to the initial feature and the n down-sampling features to obtain an image feature of the target image comprises:

6. The method of claim 1, wherein the down-sampling the initial feature by n levels to obtain n down-sampled features comprises:

7. The method of claim 1, wherein the down-sampling the initial feature by n levels to obtain n down-sampled features comprises:

8. An image processing apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a processor and a memory, the memory having stored therein at least one computer instruction, the at least one computer instruction being loaded and executed by the processor to implement the image processing method according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored therein at least one computer instruction, which is loaded and executed by a processor to implement the image processing method according to any one of claims 1 to 7.