CN114283152A

CN114283152A - Image processing method, image processing model training method, image processing device, image processing equipment and image processing medium

Info

Publication number: CN114283152A
Application number: CN202110951040.4A
Authority: CN
Inventors: 林一; 曲志勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2022-04-05

Abstract

The application discloses an image processing method, an image processing device, an image processing model training device and an image processing medium, belongs to the technical field of artificial intelligence, and relates to a computer vision technology in the technical field of artificial intelligence. The image processing method comprises the following steps: acquiring an image to be processed and an image processing model, wherein the image processing model comprises a coding model and a decoding model, the coding model comprises an attention model and a convolution model, the attention model is coded based on global information, and the convolution model is coded based on local information; calling an attention model and a convolution model to encode the image to be processed based on the global information and the local information to obtain target encoding characteristics; calling a decoding model to decode the target coding features to obtain target image features; and acquiring a segmentation result based on the target image characteristics. In the mode, the target image characteristics are obtained by comprehensively focusing on the global information and the local information, the focused information is rich, the reliability is high, and the accuracy of the segmentation result is improved.

Description

Image processing method, image processing model training method, image processing device, image processing equipment and image processing medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an image processing method, an image processing device, an image processing model training device, an image processing equipment and a medium.

Background

With the development of artificial intelligence technology, more and more application scenes are available for processing images. One application scenario is as follows: the image comprises a sub-image of a reference object, and an image processing model is called to process the image so as to segment the sub-image in the image and obtain a segmentation result of the image.

In the related art, the image features on which the segmentation result of the image is obtained are features obtained by focusing on only local information, and the focused information is limited, so that the reliability of the image features is poor, and the accuracy of the obtained segmentation result is poor.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, an image processing model training method, an image processing device, an image processing equipment and a medium, which can be used for improving the reliability of image characteristics and further improving the accuracy of an obtained segmentation result. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides an image processing method, where the method includes:

acquiring an image to be processed and an image processing model, wherein the image processing model comprises a coding model and a decoding model, the coding model comprises an attention model and a convolution model, the attention model is coded based on global information, and the convolution model is coded based on local information;

calling the attention model and the convolution model to encode the image to be processed based on global information and local information to obtain target encoding characteristics;

calling the decoding model to decode the target coding features to obtain target image features;

and acquiring a segmentation result of the image to be processed based on the target image characteristic.

There is also provided a method of training an image processing model, the method comprising:

acquiring a sample image, label information of the sample image and an initial image processing model, wherein the initial image processing model comprises an initial coding model and an initial decoding model, and the initial coding model comprises an initial attention model and an initial convolution model;

calling the initial attention model and the initial convolution model to encode the sample image based on global information and local information to obtain sample encoding characteristics;

calling the initial decoding model to decode the sample coding features to obtain sample image features;

obtaining a segmentation result of the sample image based on the sample image characteristics;

and training the initial image processing model based on the segmentation result of the sample image and the label information of the sample image to obtain an image processing model.

In another aspect, there is provided an image processing apparatus, the apparatus including:

the image processing system comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring an image to be processed and an image processing model, the image processing model comprises a coding model and a decoding model, the coding model comprises an attention model and a convolution model, the attention model is coded based on global information, and the convolution model is coded based on local information;

the second obtaining unit is used for calling the attention model and the convolution model to encode the image to be processed based on global information and local information to obtain target encoding characteristics;

the third obtaining unit is used for calling the decoding model to decode the target coding features to obtain target image features;

and the fourth acquisition unit is used for acquiring the segmentation result of the image to be processed based on the target image characteristic.

In a possible implementation manner, the number of the coding models is at least one, and the second obtaining unit is configured to invoke an attention model and a convolution model in a first coding model to code the image to be processed based on global information and local information, so as to obtain a basic feature and a connection feature output by the first coding model; starting from a second coding model, calling an attention model and a convolution model in a next coding model to code the basic features output by a previous coding model based on global information and local information to obtain the basic features and connection features output by the next coding model until the basic features and connection features output by a penultimate coding model are obtained, wherein the connection features output by each coding model from the first coding model to the penultimate coding model are used for providing data support for executing the step of calling the decoding model to decode the target coding features to obtain target image features; and calling an attention model and a convolution model in the last coding model to code the basic features output by the penultimate coding model based on global information and local information to obtain the connection features output by the last coding model, and taking the connection features output by the last coding model as the target coding features.

In a possible implementation manner, the attention model in the first coding model includes a first attention model, and the second obtaining unit is further configured to invoke a convolution model in the first coding model and the first attention model to encode the image to be processed based on local information and global information, to obtain a basic feature output by the first coding model, and to obtain a basic feature output by the first coding model; and acquiring the connection characteristics output by the first coding model based on the basic characteristics output by the first coding model.

In a possible implementation manner, the second obtaining unit is further configured to invoke a convolution model in the first coding model to code the image to be processed based on local information, so as to obtain a first coding feature; calling the first attention model to encode the image to be processed based on global information to obtain a second encoding characteristic; fusing the first coding feature and the second coding feature to obtain a fused feature; and acquiring the basic characteristics output by the first coding model based on the fusion characteristics.

In a possible implementation manner, the second obtaining unit is further configured to obtain a block feature of each image block of the to-be-processed image, and map the block feature of each image block to obtain a mapping feature of each image block; acquiring the position characteristics of an image block; acquiring reference characteristics of the image to be processed based on the mapping characteristics of each image block and the position characteristics of the image block; and calling the first attention model to encode the reference features of the image to be processed based on global information to obtain the second encoding features.

In a possible implementation manner, the first attention model includes an attention module and a non-linear processing module, and the second obtaining unit is further configured to invoke the attention module to process the reference feature, so as to obtain a first intermediate feature; splicing the first intermediate feature and the reference feature to obtain a feature to be processed; calling the nonlinear processing module to process the to-be-processed features to obtain second intermediate features; and splicing the second intermediate characteristic and the characteristic to be processed to obtain the second coding characteristic.

In a possible implementation manner, the second obtaining unit is further configured to invoke a convolution model in the first coding model to code the image to be processed based on local information, so as to obtain a first coding feature; calling the first attention model to encode the first encoding characteristic based on global information to obtain a third encoding characteristic; and acquiring the basic characteristics output by the first coding model based on the third coding characteristics.

In a possible implementation manner, the second obtaining unit is further configured to invoke the first attention model to encode the image to be processed based on global information to obtain a second encoding feature; calling a convolution model in the first coding model to code the second coding feature based on local information to obtain a fourth coding feature; and acquiring the basic characteristics output by the first coding model based on the fourth coding characteristics.

In a possible implementation manner, the attention model in the first coding model includes a second attention model, and the second obtaining unit is further configured to call a convolution model in the first coding model to encode the image to be processed based on local information, so as to obtain a basic feature output by the first coding model; and calling the second attention model to encode the basic features output by the first coding model based on global information to obtain the connection features output by the first coding model.

There is also provided an apparatus for training an image processing model, the apparatus comprising:

a first obtaining unit, configured to obtain a sample image, label information of the sample image, and an initial image processing model, where the initial image processing model includes an initial coding model and an initial decoding model, and the initial coding model includes an initial attention model and an initial convolution model;

the second obtaining unit is used for calling the initial attention model and the initial convolution model to code the sample image based on global information and local information to obtain sample coding characteristics;

the third obtaining unit is used for calling the initial decoding model to decode the sample coding features to obtain sample image features;

a fourth obtaining unit, configured to obtain a segmentation result of the sample image based on the sample image feature;

and the training unit is used for training the initial image processing model based on the segmentation result of the sample image and the label information of the sample image to obtain the image processing model.

In one possible implementation, the sample image includes a sub-image of a reference object, the label information of the sample image includes at least one of a point label, a first auxiliary label, or a second auxiliary label, the first auxiliary label and the second auxiliary label are both derived based on the point label, and the point label is determined based on a reference point within a region where the sub-image is located in the sample image.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one computer program is stored in the memory, and the at least one computer program is loaded by the processor and executed to enable the computer device to implement any one of the image processing methods or the training method of the image processing model described above.

In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor, so as to enable a computer to implement any one of the above-mentioned image processing methods or training methods of image processing models.

In another aspect, a computer program product or a computer program is also provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute any one of the image processing methods or the training method of the image processing model.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

according to the technical scheme provided by the embodiment of the application, the attention model and the convolution model are called firstly to obtain the target coding features based on the global information and the local information, and then the target image features are obtained according to the target coding features. The target coding features are obtained based on the global information and the local information, so that the target image features obtained according to the target coding features are obtained by comprehensively focusing on the global information and the local information, the focused information is rich, the reliability of the target image features is high, and the accuracy of the obtained segmentation results is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of an image processing method provided in an embodiment of the present application;

fig. 3 is a flowchart of a process of calling an attention model and a convolution model in a first coding model to code an image to be processed based on global information and local information to obtain a basic feature and a connection feature output by the first coding model according to an embodiment of the present application;

fig. 4 is a schematic diagram of a process of calling a convolution model in a first coding model to code an image to be processed based on local information to obtain a first coding feature according to an embodiment of the present application;

fig. 5 is a schematic diagram of a process of calling a first attention model to encode a reference feature of an image to be processed based on global information to obtain a second encoding feature according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a process for obtaining a target image feature according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a training method of an image processing model according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a reference object tag and a point tag provided in an embodiment of the present application;

FIG. 9 is a schematic view of a first auxiliary label provided by an embodiment of the present application;

FIG. 10 is a schematic illustration of a second auxiliary label provided by an embodiment of the present application;

FIG. 11 is a flow chart of a method of processing a tissue pathology image provided by an embodiment of the present application;

fig. 12 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an apparatus for training an image processing model according to an embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of a server provided in an embodiment of the present application;

fig. 15 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In an exemplary embodiment, the image processing method and the training method of the image processing model provided by the embodiment of the application can be applied to the technical field of artificial intelligence. The artificial intelligence technique is described next.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like. The embodiment of the application provides an image processing method and an image processing model training method, which relate to a computer vision technology and a machine learning technology.

Computer Vision (CV) technology is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, Three-Dimensional object reconstruction, 3D (Three Dimensional) technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, smart transportation, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.

In an exemplary embodiment, the image processing method and the training method of the image processing model provided in the embodiment of the present application are implemented in a blockchain system, and the image to be processed, the image processing model, the segmentation result of the image to be processed, and the like involved in the image processing method provided in the embodiment of the present application, and the sample image, the label information of the sample image, and the initial image processing model involved in the training method of the image processing model are all stored on a blockchain in the blockchain system, and are applied to each node device in the blockchain system, so as to ensure the security and reliability of data.

Fig. 1 is a schematic diagram illustrating an implementation environment provided by an embodiment of the present application. The implementation environment includes: a terminal 11 and a server 12.

The image processing method provided by the embodiment of the present application may be executed by the terminal 11, may also be executed by the server 12, and may also be executed by both the terminal 11 and the server 12, which is not limited in the embodiment of the present application. For the image processing method provided by the embodiment of the application, when the terminal 11 and the server 12 execute together, the server 12 undertakes the primary calculation work, and the terminal 11 undertakes the secondary calculation work; or, the server 12 undertakes the secondary computing work, and the terminal 11 undertakes the primary computing work; alternatively, the server 12 and the terminal 11 perform cooperative computing by using a distributed computing architecture.

The training method of the image processing model provided in the embodiment of the present application may be executed by the terminal 11, or may be executed by the server 12, or may be executed by both the terminal 11 and the server 12, which is not limited in the embodiment of the present application. For the case that the terminal 11 and the server 12 jointly execute the training method of the image processing model provided by the embodiment of the application, the server 12 undertakes the primary calculation work, and the terminal 11 undertakes the secondary calculation work; or, the server 12 undertakes the secondary computing work, and the terminal 11 undertakes the primary computing work; alternatively, the server 12 and the terminal 11 perform cooperative computing by using a distributed computing architecture.

The image processing method and the training method of the image processing model provided in the embodiment of the present application may be executed by the same device, or may be executed by different devices, which is not limited in the embodiment of the present application.

In one possible implementation manner, the terminal 11 may be any electronic product capable of performing human-Computer interaction with a user through one or more manners of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction, or a handwriting device, for example, a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a PPC (Pocket PC, palmtop), a tablet Computer, a smart car, a smart television, a smart sound box, and the like. The server 12 may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.

It should be understood by those skilled in the art that the above-mentioned terminal 11 and server 12 are only examples, and other existing or future terminals or servers may be suitable for the present application and are included within the scope of the present application and are herein incorporated by reference.

Based on the implementation environment shown in fig. 1, the embodiment of the present application provides an image processing method, where the image processing method is executed by a computer device, and the computer device may be the server 12 or the terminal 11, which is not limited in the embodiment of the present application. As shown in fig. 2, the image processing method provided in the embodiment of the present application includes the following steps 201 to 204.

In step 201, an image to be processed and an image processing model are obtained, where the image processing model includes a coding model and a decoding model, the coding model includes an attention model and a convolution model, the attention model is coded based on global information, and the convolution model is coded based on local information.

The image to be processed is an image that needs to be processed to obtain the segmentation result. Exemplarily, the image to be processed includes a sub-image of the reference object, the sub-image of the reference object refers to an image area to be focused included in the image to be processed, and the process of processing the image to be processed refers to a process of segmenting the sub-image of the reference object in the image to be processed. The type of the image to be processed is not limited, and the types of the reference objects in different types of images to be processed are different.

Illustratively, the image to be processed is a histopathology image obtained by image acquisition of pathological tissues in a pathology slide, and the reference object is some tissues which are small in volume, large in number and close in arrangement, such as cell nuclei, cells or blood vessels, in the histopathology image. The pathological slide may have a specific characteristic, for example, the pathological slide may be a pathological slide of a biological tissue in which a certain lesion occurs, such as a pathological slide of an animal or plant tissue in which a certain lesion occurs, or a pathological slide of a tumor tissue in a certain part of a human body, and the like. Illustratively, the pathology slide is a stained slide, illustratively, a pathology slide is a slide stained by an HE (Hematoxylin Eosin) stain, and by performing image acquisition on a certain area visual field in the pathology slide stained by the HE stain, a histopathology image may be obtained as an image to be processed.

Illustratively, the image to be processed is a street view image obtained by image capturing a street scene, and the reference object is some specific element in the street, such as a vehicle, a pedestrian, and the like. Illustratively, the image to be processed is an indoor image obtained by image acquisition of an indoor scene, and the reference object is some specific facility in the indoor, such as a table, a chair, a bed, and the like.

The acquisition mode of the image to be processed is not limited in the embodiment of the application. In an exemplary embodiment, the manner in which the computer device acquires the image to be processed includes, but is not limited to: the computer equipment extracts an image to be processed from the image library; the image acquisition equipment which is in communication connection with the computer equipment sends the acquired image to be processed to the computer equipment; and the computer equipment acquires the manually uploaded images to be processed and the like. In an exemplary embodiment, the image to be processed may refer to an original image acquired by an image acquisition device, or may be an image obtained by preprocessing the original image acquired by the image acquisition device, which is not limited in the embodiment of the present application. Illustratively, the manner of preprocessing the original image includes, but is not limited to, at least one of cropping, rotating, flipping, data enhancement.

It should be noted that the number of the images to be processed is one or more, and for the case that the number of the images to be processed is multiple, each image to be processed obtains a segmentation result according to the manner from step 202 to step 204. The embodiment of the present application takes the number of images to be processed as one example for explanation.

The image processing model is used for processing the image to be processed, the image processing model is obtained by training the initial image processing model, and the mode of obtaining the image processing model by training refers to the embodiment shown in fig. 7, which is not repeated herein. The image processing model may be obtained by training in a real-time training manner, or may be extracted from an image processing model trained and stored in advance, which is not limited in the embodiment of the present application.

The image processing model comprises a coding model and a decoding model, wherein the coding model is used for acquiring coding characteristics, and the decoding model is used for decoding on the basis of the coding characteristics to obtain image characteristics. That is, the coding features and the decoding model are used to jointly extract image features of the image. In the embodiment of the present application, the coding model includes an attention model and a convolution model. Wherein the attention model is coded based on global information and the convolution model is coded based on local information. Illustratively, the global information on which the attention model depends refers to global information corresponding to information (e.g., features, images, etc.) of the input attention model, and the local information on which the convolution model depends refers to local information corresponding to information (e.g., features, images, etc.) of the input convolution model. That is to say, in the encoding process, the encoding model provided in the embodiment of the present application may focus on both global information and local information, so that an encoding feature with high reliability may be obtained, and thus an image feature with high reliability may be obtained. Illustratively, global information is obtained by focusing on distant dependencies.

The attention model is a model based on an attention mechanism, and can acquire global context information (referred to as global information for short); the convolution model is a CNN (Convolutional Neural Networks) model, which is a kind of feed-forward Neural network including convolution operation and having a deep structure. The CNN model is able to acquire fine local features through a strong convolution operation. The convolution operation adopted by the CNN uses a weight sharing mode, and the scope of the reception field is limited by the size of the convolution kernel and the depth of the network, so that the network structure based on the CNN usually has the problem of insufficient reception field, especially when the size of the reference object is large. Attention models based on the attention mechanism can overcome this limitation well. That is, by designing a coding model including an attention model and a convolution model, it is possible to integrate the advantages of the attention model and the convolution model, while preserving global information while acquiring local features.

In an exemplary embodiment, the image processing model comprises one or more coding models, each coding model comprising at least one attention model and one convolution model. In the case where the image processing model includes a plurality of coding models, different coding models have different arrangement positions, and a coding model arranged at a front position performs a coding operation first, and a coding model arranged at a rear position performs a coding operation later. For example, the different coding models may include the attention model and the convolution model, which may be the same or different, and this is not limited in this embodiment of the application.

Illustratively, the number of decoding models in the image processing model is the same as the number of coding models, and for the case where the number of coding models is N (N is an integer not less than 1), the number of decoding models is also N. For example, in the case where the number of decoding models is plural, different decoding models have different arrangement positions, and the decoding model arranged at the front performs the decoding operation first, and the decoding model arranged at the rear performs the decoding operation later. For example, in the case where the number of the coding models and the number of the decoding models are both plural, the plural coding models and the plural decoding models constitute a U-shaped structure.

After acquiring the image to be processed and the image processing model, the subsequent steps 202 to 204 are performed to acquire the segmentation result of the image to be processed.

In step 202, an attention model and a convolution model are called to encode the image to be processed based on the global information and the local information, and a target encoding characteristic is obtained.

And after the image to be processed and the image processing model are obtained, an attention model and a convolution model in the coding model are called to code the image to be processed based on the global information and the local information, so that the target coding feature is obtained. The target coding features are features to be decoded corresponding to the images to be processed.

In a possible implementation manner, the number of the coding models is at least one, and for a case where a first coding model, a second-to-last coding model, and a last coding model exist in at least one coding model, the process of calling the attention model and the convolution model to code the image to be processed based on the global information and the local information to obtain the target coding model includes the following steps 2021 to 2023:

step 2021: and calling an attention model and a convolution model in the first coding model to code the image to be processed based on the global information and the local information to obtain the basic characteristics and the connection characteristics output by the first coding model.

The connection features output by the first coding model are obtained based on the basic features output by the first coding model, and since the coding process of the second coding model needs to utilize the basic features output by the first coding model and the decoding process of the decoding model needs to utilize the connection features output by the first coding model, after an attention model and a convolution model in the first coding model are called to code an image to be processed based on global information and local information, the basic features and the connection features output by the first coding model need to be obtained.

According to step 201, the number of coding models is one or more, and each coding model includes one or more attention models and a convolution model. That is, the first coding model includes a convolution model and one or more attention models. The implementation of this step 2021 differs depending on the attention model in the first coding model.

In an exemplary embodiment, the attention model in the first coding model comprises at least one of a first attention model and a second attention model. That is, the first coding model may include the first attention model, may include the second attention model, and may include both the first attention model and the second attention model. The first attention model and the second attention model have different functions, and illustratively, the model structures of the first attention model and the second attention model are the same, but the model parameters of the first attention model and the second attention model are different due to the different functions. In an exemplary embodiment, the first attention model functions to derive the underlying features by co-acting with a convolution model in the first coding model, and the second attention model functions to derive the connection features by coding the underlying features.

In one possible implementation, referring to fig. 3, in the case that the attention model in the first coding model comprises the first attention model, the implementation process of this step 2021 comprises the following

steps

2021A and 2021B.

Step 2021A: and calling a convolution model and a first attention model in the first coding model to code the image to be processed based on the local information and the global information to obtain the basic characteristics output by the first coding model.

Implementations of this step 2021A include, but are not limited to, three, which are described below.

The first implementation mode is realized based on the steps (1) to (4):

step (1): and calling a convolution model in the first coding model to code the image to be processed based on the local information to obtain a first coding characteristic.

Convolution models are used to extract features through convolution operations that focus on local information. The embodiment of the application does not limit the model structure of the convolution model in the first coding model, the process of coding the image to be processed based on the local information is the internal processing process of the convolution model, and the specific process of calling the convolution models with different model structures to code the image to be processed based on the local information may be different. In an exemplary embodiment, each coding model includes a convolution model, and the convolution models in different coding models may have the same or different model structures. Illustratively, the convolution models in different coding models have the same model structure, so as to improve the convergence speed in the model training process.

In one possible implementation manner, the convolution model in the first coding model includes a convolution module and a pooling module, and the convolution model in the first coding model is called to code the image to be processed based on the local information, so as to obtain the first coding feature in the following manner: calling a convolution module to process an image to be processed to obtain convolution characteristics; and calling a pooling module to process the convolution characteristics to obtain first coding characteristics. Illustratively, the convolution model in the first coding model includes one or more convolution modules and a pooling module. For example, in the case that the convolution model includes a plurality of convolution modules, the structures of the different convolution modules may be the same or different, and this is not limited in this embodiment of the present application.

Illustratively, for the case that the number of the convolution modules is multiple, the plurality of convolution modules are sequentially connected in series, the convolution module is called to process the image to be processed, and the process of obtaining the convolution characteristics is as follows: inputting an image to be processed into a first convolution module to obtain characteristics output by the first convolution module; and inputting the features output by the last convolution module into the next convolution module from the second convolution module to obtain the features output by the next convolution module until the features output by the last convolution module are obtained, and taking the features output by the last convolution module as convolution features.

Each convolution module is composed of a convolution layer and an active layer, and the size of the convolution kernel of the convolution layer is set empirically or flexibly adjusted according to the actual application scenario, which is not limited in the embodiment of the present application, for example, the convolution kernel size of the convolution layer is 3 × 3. The activation function used by the activation layer is set empirically or flexibly adjusted according to an actual application scenario, which is not limited in the embodiment of the present application, for example, the activation function used by the activation layer is a ReLU (Rectified Linear Unit) function or a Sigmoid function. In the process of calling a convolution module to process the characteristics (or images), firstly calling the convolution layer to process, and then calling the activation layer to process the characteristics output by the convolution layer.

The pooling module is exemplarily composed of one pooling layer, and the embodiment of the present application does not limit the type of the pooling layer, for example, the pooling layer in the pooling module is a maximum pooling layer or an average pooling layer, etc. The size of the sampling kernel of the pooling layer is not limited in the embodiment of the present application, for example, the size of the sampling kernel of the pooling layer is 2 × 2. For the case where the sampling kernel of the pooling layer is 2 x 2 in size, the pooling layer can sample the convolution features as the original 1/4. Illustratively, if the size of the convolution feature obtained by calling the convolution module to process the image to be processed is the same as the size of the image to be processed, the size of the first coding feature obtained by calling the pooling module to process the convolution feature is 1/4 of the size of the image to be processed.

Illustratively, the convolution model in the first coding model includes two convolution modules connected in series, each convolution module including a convolution layer and an active layer, and a pooling module including a pooling layer. The convolution layer and the active layer in the first convolution module are respectively convolution layer 1 and active layer 1, the convolution layer and the active layer in the second convolution module are respectively convolution layer 2 and active layer 2, and then the convolution model in the first coding model is called to code the image to be processed based on the local information, so that the process of obtaining the first coding feature is shown in fig. 4. Inputting an image to be processed into a convolution model in a first coding model, sequentially processing a convolution layer 1, an activation layer 1, a convolution layer 2, an activation layer 2 and a pooling layer, and outputting a first coding characteristic.

Step (2): and calling the first attention model to encode the image to be processed based on the global information to obtain a second encoding characteristic.

In a first implementation, the processing branch of the first attention model is in a parallel relationship with the processing branch of the convolution model. And calling the first attention model to encode the image to be processed based on the global information, wherein the process of encoding the image to be processed is an internal processing process of the first attention model, and the specific processing mode is related to the model structure of the first attention model. The embodiment of the present application does not limit the model structure of the first attention model as long as the processing is performed by the attention mechanism. Illustratively, the first attention model is a Transformer model.

In an exemplary embodiment, the computational complexity of the first attention model is equivalent to the square of the number of tokens (Token) (i.e. the sequence length), and invoking the first attention model to encode the image to be processed based on the global information means invoking the first attention model to encode the reference feature of the image to be processed based on the global information. The reference features of the image to be processed are obtained based on each image block of the image to be processed, and compared with the method that each pixel point in the image to be processed is directly used as Token and each image block is used as Token, the method is beneficial to reducing the number of tokens, and therefore the calculation complexity is reduced. For example, the process of acquiring the reference feature of the image to be processed can be regarded as a process of serializing the image to be processed.

Before calling the first attention model to encode the reference features of the image to be processed based on the global information, the reference features of the image to be processed need to be acquired. The reference feature of the image to be processed is a feature which is suitable for the attention model and corresponds to the image to be processed. In one possible implementation, the process of acquiring the reference feature of the image to be processed includes the following steps 1-1 to 1-3:

step 1-1: the method comprises the steps of obtaining block features of each image block of an image to be processed, and mapping the block features of each image block to obtain the mapping features of each image block.

The image blocks of the image to be processed are processed by processing the image to be processedThe segmentation is performed, for example, by calling ViT (Vision Transformer) model to segment the image to be processed to obtain each image block of the image to be processed. Illustratively, the sizes of the respective image blocks of the image to be processed are the same, and the size of the image block is denoted as P × P (pixels). The size of the image block is set empirically, or flexibly adjusted according to the size of the image to be processed and the application scenario, which is not limited in the embodiment of the present application, and the size of the image block is, for example, 2 × 2 (i.e., P ═ 2). The total number of each image block of the image to be processed is related to the size of the image to be processed and the size of the image block, and for example, assuming that the size of the image to be processed is H × W and the size of the image block is P × P, the total number M of each image block is: m ═ HW/P²。

After each image block in the image to be processed is obtained, the reference feature of the image to be processed is obtained based on each image block in the image to be processed, so that the calculation complexity is reduced. For example, if the size of the image block is 2 × 2, the computational complexity can be reduced to 1/4, which directly takes each pixel point in the image to be processed as the computational complexity of Token. For example, the process of obtaining the reference feature of the image to be processed based on each image block in the image to be processed may be regarded as a process of reintegrating each image block into a token, and may also be regarded as a process of image serialization.

In the process of acquiring the reference features of the image to be processed based on each image block in the image to be processed, the mapping features of each image block need to be acquired first, and the mapping features of the image blocks are obtained by mapping the block features of the image blocks and are used for visually embodying the features of pixel points in the image blocks.

In one possible implementation manner, the manner of obtaining the block feature of any image block in the image to be processed is as follows: the features of each pixel point in any image block are sequentially connected in series to obtain the block feature of any image block, and for example, assuming that the number of channels of any image block is C (C is an integer not less than 1), the dimension of the block feature of any image block is 1 × P²C。

After the block features of any image block are obtained, the block features of any image block are mapped to obtain the mapping features of any image block. The purpose of mapping the block features of any image block is to map the block features of any image block into a designated integration space for ease of computation. The designated integration space is set according to an application scenario, and illustratively, the designated integration space refers to a D (D is an integer not less than 1) dimensional integration space.

In an exemplary embodiment, the block features of any image block are mapped in the following manner: multiplying the block characteristic of any image block with the conversion characteristic, and taking the product obtained by the multiplication as the mapping characteristic of any image block. The conversion feature is a feature for multiplication with a block feature set for mapping the block feature to a specified integration space. Illustratively, with the designation integration space referring to a D-dimensional integration space, the dimension of the block feature of any image block is 1 × (P)²C) For example, the transformation is characterized by a dimension of (P)²C) And D, so that the mapping feature with dimension of 1 x D is obtained by multiplying the block feature of any image block with the conversion feature. For example, the form of the features mentioned in the embodiments of the present application may be a vector or a matrix, which is not limited in the embodiments of the present application.

According to the mode of obtaining the block characteristics of any image block, the block characteristics of each image block can be obtained; according to the method for acquiring the mapping characteristics of any image block, the mapping characteristics of each image block can be acquired.

Step 1-2: and acquiring the position characteristics of the image block.

In the process of acquiring the reference features of the image to be processed based on each image block in the image to be processed, the mapping features of each image block and the position features of the image blocks need to be acquired. The image block position characteristics are determined based on the spatial position of each image block in the image to be processed and are used for reflecting spatial information. Illustratively, the spatial position of any image block in the image to be processed is represented by the number of the image block, for example, if any image block is the h-th (h is an integer not less than 1), the spatial position of any image block in the image to be processed is denoted as h.

In an exemplary embodiment, image Block location feature E_posBased on formula 1, the following calculation results:

where pos denotes a spatial position where the image block is located in the image to be processed, and pos is 1,2,3, …, M (M is the total number of the respective image blocks); i denotes a certain dimension in the location feature, i ═ 1,2,3, …, D (M is the dimension specifying the integration space).

Step 1-3: and acquiring the reference characteristics of the image to be processed based on the mapping characteristics and the image block position characteristics of each image block.

The reference features of the image to be processed are integrated with the position information of the image block, which is beneficial to keeping the spatial information of the image block. In an exemplary embodiment, based on the mapping feature and the image block location feature of each image block, the manner of obtaining the reference feature of the image to be processed is as follows: and acquiring the splicing characteristics of the mapping characteristics of each image block, and performing element-to-element addition operation on the splicing characteristics of the mapping characteristics of each image block and the position characteristics of the image block to obtain the reference characteristics of the image to be processed. Illustratively, the dimension of the stitching feature of the mapping feature of each image block is the same as the dimension of the image block location feature in order to perform the inter-element addition operation. Illustratively, the reference feature of the image to be processed is calculated based on equation 2:

wherein z represents a reference feature of the image to be processed; e denotes the characteristics of the conversion,

(j ═ 1,2,3, …, M) denotes a block feature of the jth image block;

shows the jth graphMapping characteristics of the image blocks;

a stitching feature representing a mapping feature of each image block; e_posThe image block location characteristics are represented.

And after the reference features of the image to be processed are obtained, calling the first attention model to encode the reference features of the image to be processed based on the global information to obtain second encoding features. The implementation manner of invoking the first attention model to encode the reference feature of the image to be processed based on the global information is related to the model structure of the first attention model, which is not limited in the embodiment of the present application.

In an exemplary embodiment, the first attention model includes an attention module and a non-linear processing module, the first attention model is called to encode the reference feature of the image to be processed based on the global information, and the process of obtaining the second encoding feature is as follows: calling an attention module to process the reference feature to obtain a first intermediate feature; splicing the first intermediate characteristic and the reference characteristic to obtain a characteristic to be processed; calling a nonlinear processing module to process the feature to be processed to obtain a second intermediate feature; and splicing the second intermediate characteristic and the characteristic to be processed to obtain a second coding characteristic.

The Attention module is configured to process the reference feature based on an Attention mechanism, illustratively, the Attention module includes a normalization layer and a Self-Attention layer, e.g., the Self-Attention layer is a Multi-head Self Attention (MSA) layer. The nonlinear processing module is configured to perform nonlinear processing on the feature to be processed to increase nonlinearity of the processing result, and illustratively, the nonlinear processing module includes a normalization Layer and a nonlinear processing Layer, where the nonlinear processing Layer is, for example, a Multi-Layer Perceptron (MLP) Layer. The Normalization Layer is used to normalize the features, and the embodiment of the present application does not limit the type of the Normalization Layer, for example, the Normalization Layer is a Layer Normalization (LN) Layer, a Batch Normalization (Batch Normalization) Layer, or the like. The normalization layer included in the attention module and the normalization layer included in the non-linear processing module may be the same or different, and this is not limited in this embodiment of the application.

For example, the embodiment of the present application does not limit the manner of splicing the two features, for example, the two features are transversely spliced; or, longitudinally splicing the two features; alternatively, the two features are subjected to an inter-element addition operation.

Illustratively, assume that the first attention model includes an attention module and a nonlinear processing module, the attention module includes a normalization layer and an MSA layer, the nonlinear processing module includes a normalization layer and an MLP layer, the normalization layer in the attention module is denoted as normalization layer 1, and the normalization layer in the nonlinear processing module is denoted as normalization layer 2. The process of calling the first attention model to encode the reference feature of the image to be processed based on the global information to obtain the second encoding feature is shown in fig. 5. Inputting the reference characteristics into a first attention model, and sequentially processing the reference characteristics through a normalization layer 1 and an MSA layer to obtain first intermediate characteristics; splicing the first intermediate characteristic and the reference characteristic to obtain a characteristic to be processed; the features to be processed are sequentially processed by the normalization layer 2 and the MLP layer to obtain second intermediate features; and splicing the second intermediate characteristic and the characteristic to be processed to obtain a second coding characteristic.

In an exemplary embodiment, in the case that the first attention model includes an attention module and a nonlinear processing module, the attention module includes an LN layer and an MSA layer, and the nonlinear processing module includes an LN layer and an MLP layer, the first attention model is called to encode the reference feature of the image to be processed based on the global information, and the process of obtaining the second encoding feature is implemented based on formula 3:

wherein z is_l-1Representing features of the input first attention model, i.e. reference features of the image to be processed; LN (z)_l-1) The reference characteristic is processed by an LN layer in the attention module to obtain a characteristic; MSA (LN (z)_l-1) Represents a first intermediate feature; z'_lRepresenting the characteristic to be processed; LN (z'_l) Representing the characteristics obtained after the characteristics to be processed are processed by an LN layer in a nonlinear processing module; MLP (LN (z)'_l) Represents a second intermediate feature; z is a radical of_lRepresenting a second encoding characteristic.

It should be noted that the above-mentioned first attention model including the attention module and the non-linear processing module is only one exemplary structure of the first attention model, and the embodiment of the present application is not limited thereto. Exemplarily, the first attention model may also comprise only the attention module. The above-mentioned attention module including the normalization layer and the self-attention layer and the nonlinear processing layer including the normalization layer and the nonlinear processing layer are also exemplary structures of the attention module and the nonlinear processing module, and exemplarily, the attention module may further include only the self-attention layer and the nonlinear processing module may further include only the nonlinear processing layer.

It should be noted that, the above-mentioned invoking of the first attention model to encode the image to be processed based on the global information means that invoking of the first attention model to encode the reference feature of the image to be processed based on the global information is only an exemplary description, and the embodiment of the present application is not limited thereto. In an exemplary embodiment, invoking the first attention model to encode the image to be processed based on the global information may also refer to invoking the first attention model to encode pixel-level features of the image to be processed based on the global information. The pixel-level features of the image to be processed are obtained based on the features of each pixel point in the image to be processed. In this case, a pixel in the image to be processed is used as a Token. For example, the process of calling the first attention model to encode the pixel-level features of the image to be processed based on the global information to obtain the second encoding features refers to a process of calling the first attention model to encode the reference features of the image to be processed based on the global information to obtain the second encoding features, which is not described herein again.

And (3): and fusing the first coding feature and the second coding feature to obtain a fused feature.

The first coding feature is obtained by calling a convolution model to code the image to be processed based on the local information and can focus on the local information, the second coding feature is obtained by calling the first attention model to code the image to be processed based on the global information and can focus on the global information, and the first coding feature and the second coding feature are fused to obtain a fusion feature obtained by comprehensively focusing on the global information and the local information.

In one possible implementation manner, the process of fusing the first coding feature and the second coding feature to obtain a fused feature is as follows: converting the second coding features into features to be connected in parallel with the first coding features in the same size; and connecting the features to be connected in parallel and the first coding features in parallel in the channel dimension to obtain fusion features.

Exemplarily, the feature to be connected in parallel may refer to a feature obtained by converting only the size of the second encoding feature; the feature obtained by converting the size of the second coding feature and the number of channels of the second coding feature may also be used, which is not limited in the embodiment of the present application. The embodiment of the present application takes the feature to be connected in parallel as an example, which is obtained by converting the size of the second coding feature and converting the number of channels of the second coding feature.

Illustratively, the shape of the first coding feature is [ H ]_l,W_l,C]The shape of the second coding feature is [ N, D ]]Then the shape of the second coding feature is first converted to [ H ]_l,W_l,D]Then, the number of channels is transformed by using a convolution kernel of 1 × 1 to finally obtain a shape of [ H_l,W_l,C]To be paralleled feature(s). Then, the first coding feature and the feature to be connected in parallel are connected in parallel in the channel dimension to obtain the shape of [ H_l,W_l,2C]The fusion characteristics of (1).

It should be noted that, the manner of fusing the first coding feature and the second coding feature to obtain the fused feature is only an exemplary description, and the embodiments of the present application are not limited thereto. In an exemplary embodiment, the first coding feature and the second coding feature are fused to obtain a fused feature in the following manner: converting the second coding features into features to be connected in parallel, wherein the features are the same as the first coding feature in size and channel number; and taking the average characteristic or the addition characteristic of the characteristics to be connected in parallel and the first coding characteristic as the obtained fusion characteristic.

And (4): and acquiring the basic characteristics output by the first coding model based on the fusion characteristics.

The fusion feature is a feature obtained by comprehensively focusing on global information and local information, and the reliability of the basic feature output by the first coding model obtained based on the fusion feature is high. The base features of the first coding model output are used as input for the next coding model. Based on the fusion features, the implementation manner of obtaining the basic features output by the first coding model is flexibly set as required, which is not limited in the embodiment of the present application.

In one possible implementation, the fused features are directly used as the basis features for the first coding model output.

In another possible implementation manner, the first coding model further includes a third attention model, and the third attention model is called to code the fusion features based on the global information to obtain the basic features output by the first coding model. The third attention model refers to an attention model for encoding a feature on which the basic feature is directly acquired. The principle of calling the third attention model to encode the fusion features based on the global information is the same as the principle of calling the first attention model to encode the image to be processed based on the global information, and details are not repeated here.

The second implementation mode is realized based on the steps (A) to (C):

step (A): and calling a convolution model in the first coding model to code the image to be processed based on the local information to obtain a first coding characteristic.

The implementation manner of this step (a) is referred to as step (1) in the first implementation manner, and is not described herein again.

Step (B): and calling the first attention model to encode the first encoding characteristic based on the global information to obtain a third encoding characteristic.

In a second implementation, the processing branch of the first attention model is in a series relationship with the processing branch of the convolution model, and the processing branch of the first attention model is in series after the processing branch of the convolution model. That is, the processing branch of the first attention model is used to encode the first coding feature output by the processing branch of the convolution model.

Illustratively, the first coding feature is in the same form as the representation of the image to be processed, for example, the first coding feature and the image to be processed are both represented by a matrix, or the first coding feature and the image to be processed are both represented by a vector; or, the first coding feature and the image to be processed are both represented by an image. The principle of calling the first attention model to encode the first coding feature based on the global information to obtain the third coding feature is the same as the principle of calling the first attention model to encode the image to be processed based on the global information to obtain the second coding feature in step (2) in the first implementation manner, and details are not repeated here. It should be noted that, for the case that invoking the first attention model to encode the image to be processed based on the global information in step (2) refers to invoking the first attention model to encode the reference feature of the image to be processed based on the global information, invoking the first attention model to encode the first encoding feature based on the global information in step (B) refers to invoking the first attention model to encode the reference feature of the first encoding feature based on the global information. The obtaining principle of the reference feature of the first coding feature is the same as that of the reference feature of the image to be processed, and the first coding feature only needs to be represented in the form of the image.

Step (C): and acquiring the basic characteristics output by the first coding model based on the third coding characteristics.

The implementation principle of step (C) is the same as that of step (4) in the first implementation manner, and is not described here again.

The third implementation mode is realized based on the steps (I) to (III):

step (I): and calling the first attention model to encode the image to be processed based on the global information to obtain a second encoding characteristic.

The implementation manner of step (i) is referred to as step (2) in the first implementation manner, and is not described herein again.

Step (II): and calling a convolution model in the first coding model to code the second coding characteristic based on the local information to obtain a fourth coding characteristic.

In a third implementation, the processing branch of the first attention model is in a series relationship with the processing branch of the convolution model, and the processing branch of the convolution model is in series after the processing branch of the first attention model. That is, the processing branch of the convolution model is used to encode the second encoding feature output by the processing branch of the first attention model.

Illustratively, the second coding feature is in the same form as the representation of the image to be processed, for example, the second coding feature and the image to be processed are both represented by a matrix, or the second coding feature and the image to be processed are both represented by a vector; or, the second coding feature and the image to be processed are both represented by an image. The principle of calling the convolution model in the first coding model to code the second coding feature based on the local information to obtain the fourth coding feature is the same as the principle of calling the convolution model in the first coding model to code the image to be processed based on the local information in step (1) in the first implementation manner to obtain the first coding feature, and details are not repeated here.

Step (III): and acquiring the basic characteristics output by the first coding model based on the fourth coding characteristics.

The implementation principle of step (iii) is the same as that of step (4) in the first implementation manner, and is not described here again.

Step 2021B: and acquiring the connection characteristics output by the first coding model based on the basic characteristics output by the first coding model.

The connection characteristics of the first coding model output are used to provide data support for performing step 203, that is, the connection characteristics of the first coding model output are utilized in performing step 203. In a case that the attention model in the first coding model includes the first attention model, the manner of obtaining the connection feature output by the first coding model based on the basic feature output by the first coding model is set empirically or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application.

In an exemplary embodiment, based on the basic features of the first coding model output, the manner of obtaining the connection features of the first coding model output is as follows: and directly taking the basic features output by the first coding model as the connection features output by the first coding model.

In an exemplary embodiment, the attention model in the first coding model further comprises a second attention model, and the second attention model is used for coding the basic features to obtain the connection features. In this case, based on the basic feature output by the first coding model, the manner of obtaining the connection feature output by the first coding model is as follows: and calling a second attention model to encode the basic features output by the first coding model based on the global information to obtain the connection features output by the first coding model.

Illustratively, the model structure of the second attention model is the same as that of the first attention model, and illustratively, the basic features output by the first coding model are in the same representation form as that of the image to be processed, for example, the basic features output by the first coding model and the image to be processed are both represented by a matrix, or the basic features output by the first coding model and the image to be processed are both represented by a vector; or, the basic feature output by the first coding model and the image to be processed are both represented by the image. The principle of calling the second attention model to encode the basic features output by the first coding model based on the global information to obtain the connection features output by the first coding model is the same as the principle of calling the first attention model to encode the image to be processed based on the global information in step (2) in the first implementation manner to obtain the second coding features, and details are not repeated here.

In one possible implementation, referring to fig. 3, in the case that the attention model in the first coding model comprises the second attention model, the implementation process of this step 2021 comprises the following

steps

2021a and 2021 b.

Step 2021 a: and calling a convolution model in the first coding model to code the image to be processed based on the local information to obtain the basic characteristics output by the first coding model.

In the case that the attention model in the first coding model includes the second attention model, the convolution model in the first coding model is called to code the image to be processed based on the local information, and the mode of obtaining the basic feature output by the first coding model is set empirically or flexibly adjusted according to the application scenario, which is not limited in the embodiment of the present application.

In an exemplary embodiment, a convolution model in the first coding model is called to code the image to be processed based on the local information, and the manner of obtaining the basic feature output by the first coding model is as follows: and directly using the feature obtained by calling the convolution model in the first coding model and coding the image to be processed based on the local information as the basic feature output by the first coding model.

In an exemplary embodiment, the attention model in the first coding model further includes a first attention model, in this case, the convolution model in the first coding model is called to code the image to be processed based on the local information, and the manner of obtaining the basic feature output by the first coding model is as follows: and calling a convolution model and a first attention model in the first coding model to code the image to be processed based on the local information and the global information to obtain the basic characteristics output by the first coding model. See step 2021A for implementation of this process, which is not described herein again.

Step 2021 b: and calling a second attention model to encode the basic features output by the first coding model based on the global information to obtain the connection features output by the first coding model.

Since the function of the second attention model is to derive the connection features by encoding the underlying features. Therefore, in the case that the attention model in the first coding model includes the second attention model, the connection feature output by the first coding model is obtained by calling the second attention model to encode the basic feature output by the first coding model based on the global information.

In one possible implementation, referring to fig. 3, in the case that the attention model in the first coding model includes a first attention model and a second attention model, the implementation process of step 2021 includes the following

steps

20211 and 20212.

Step 20211: and calling a convolution model and a first attention model in the first coding model to code the image to be processed based on the local information and the global information to obtain the basic characteristics output by the first coding model.

See step 2021A for an implementation of this step 20211, which is not described herein again.

Step 20212: and calling a second attention model to encode the basic features output by the first coding model based on the global information to obtain the connection features output by the first coding model.

See step 2021b for an implementation of this step 20212, which is not described herein again.

Step 2022: and calling an attention model and a convolution model in the next coding model from the second coding model to code the basic features output by the last coding model based on the global information and the local information to obtain the basic features and the connection features output by the next coding model until the basic features and the connection features output by the second encoding model from the last to the last are obtained.

The connection features output by each coding model from the first coding model to the second-to-last coding model are used for providing data support for the step (namely step 203) of calling the decoding model to decode the target coding features and obtain the target image features. That is, in performing step 203, the connection characteristics of the respective coding model outputs from the first coding model to the second-to-last coding model are utilized.

The basic features output by the previous coding model are the same as the representation form of the image to be processed, and the principle of calling the attention model and the convolution model in the next coding model to code the basic features output by the previous coding model based on the global information and the local information to obtain the basic features and the connection features output by the next coding model is the same as the principle of calling the attention model and the convolution model in the first coding model to code the image to be processed based on the global information and the local information in step 2021 to obtain the basic features and the connection features output by the first coding model, and the description is omitted here. It should be noted that each coding model includes a convolution model and at least one attention model, and the attention models in different coding models may be the same or different, which is not limited in this application.

The operation of obtaining the base feature and the connection feature of the next coding model output by continuously executing the operation of calling the attention model and the convolution model in the next coding model to encode the base feature of the last coding model output based on the global information and the local information can obtain the base feature and the connection feature of the second last coding model output, the base feature and the connection feature of the second last coding model output are used for providing data support for executing the step 2023, and the connection feature of the second last coding model output is used for providing data support for executing the step 203.

Illustratively, the size of the base feature of the next coding model output is smaller than the size of the base feature of the last coding model output. That is, in the encoding process, the basic feature whose resolution is reduced is gradually acquired. Illustratively, the size of the base feature of any coding model output is the same as the size of the connection feature of that any coding model output.

Step 2023: and calling an attention model and a convolution model in the last coding model to code the basic features output by the penultimate coding model based on the global information and the local information to obtain the connection features output by the last coding model, and taking the connection features output by the last coding model as target coding features.

After obtaining the basic feature output by the second-to-last coding model based on step 2022, calling an attention model and a convolution model in the last coding model to code the basic feature output by the second-to-last coding model based on the global information and the local information to obtain the connection feature output by the last coding model, and taking the connection feature output by the last coding model as the target coding feature. The implementation principle of this step 2023 is the same as that of step 2021, and it should be noted that, since the basic feature of the last coding model output does not need to be the input of the next coding model, only the connection feature of the last coding model output needs to be obtained in the process of executing step 2023.

It should be noted that, in the above steps 2021 to 2023, what is described is the operation that needs to be performed to obtain the target coding feature when the first coding model, the second-to-last coding model and the last coding model exist in at least one coding model (that is, the number of the at least one coding model is three or more). The embodiment of the present application is not limited thereto, and the number of the at least one coding model may also be one or two.

In an exemplary embodiment, for a case that the number of at least one coding model is one, calling an attention model and a convolution model to code an image to be processed based on global information and local information, and obtaining a target coding model in a manner of: and calling an attention model and a convolution model in the coding model to code the image to be processed based on the global information and the local information to obtain the connection characteristics output by the coding model, and taking the connection characteristics output by the coding model as a target coding model.

In an exemplary embodiment, for a case that the number of the at least one coding model is two, calling an attention model and a convolution model to code an image to be processed based on global information and local information, and obtaining a target coding model in a manner of: calling an attention model and a convolution model in a first coding model to code an image to be processed based on global information and local information to obtain basic characteristics and connection characteristics output by the first coding model; and calling an attention model and a convolution model in the second coding model to code the basic features output by the first coding model based on the global information and the local information to obtain the connection features output by the second coding model, and taking the connection features output by the second coding model as target coding features. Wherein the connection characteristics of the first coding model output are used to provide data support for performing step 203.

In step 203, a decoding model is called to decode the target coding features, so as to obtain target image features.

The target image characteristics are characteristics directly according to which the segmentation result of the image to be processed is obtained, and are characteristics which can represent the essence of the image to be processed most and are obtained in the process of calling the image processing model to process the image to be processed. The target image features are acquired according to the target decoding features, and are acquired by comprehensively focusing on the global information and the local information, and the target image features can represent the image to be processed more comprehensively.

In an exemplary embodiment, the number of decoding models is the same as the number of coding models, and each decoding model is at least one, and at least one decoding model is cascaded, that is, the decoding of the target coding features is realized by utilizing the cascaded at least one decoding model. The at least one decoding model and the at least one coding model form a U-shaped structure, and feature aggregation can be realized at different resolution levels through jump linkage between corresponding layers (namely, the decoding process of the decoding model utilizes the connection features output by the coding model).

For example, for the case where a first coding model, a second-to-last coding model, and a last coding model are present in the at least one coding model, a first decoding model, a second-to-last decoding model, and a last decoding model are also present in the at least one decoding model. In this case, the process of calling the decoding model to decode the target coding feature to obtain the target image feature includes the following steps 2031 to 2033:

step 2031: and calling the first decoding model to decode the target coding features to obtain the decoding features output by the first decoding model.

The process of calling the first decoding model to decode the target coding feature is an internal processing process of the first decoding model, is related to the model structure of the first decoding model, and has different modes of calling the first decoding model to decode the target coding feature under different structures.

In an exemplary embodiment, the first decoding model includes an upsampled layer, a convolutional layer, and an active layer. The embodiment of the present application does not limit the type of the upsampling layer, the size of the convolution kernel of the convolutional layer, and the type of the activation function used by the activation layer in the first decoding model. Illustratively, the upsampling layer in the first decoding model is a 2 × 2 bilinear interpolation layer, the size of the convolution kernel of the convolutional layer in the first decoding model is 3 × 3, and the activation function of the active layer sampling in the first decoding model is a ReLU function.

And inputting the target coding characteristics into the first decoding model, and sequentially processing the target coding characteristics through an up-sampling layer, a convolutional layer and an activation layer to obtain the decoding characteristics output by the first decoding model. In an exemplary embodiment, the size of the decoded features output by the first decoding model is larger than the size of the target encoded features input to the first decoding model to obtain features of greater resolution by decoding.

Step 2032: starting from the second decoding model, calling the next decoding model to decode the splicing characteristics corresponding to the next decoding model to obtain the decoding characteristics output by the next decoding model until the decoding characteristics output by the penultimate decoding model are obtained; and the splicing characteristic corresponding to the next decoding model is the splicing characteristic of the decoding characteristic output by the last decoding model and the connection characteristic output by the coding model corresponding to the next decoding model.

After the decoding characteristics output by the previous decoding model are obtained, the decoding characteristics output by the previous decoding model and the connection characteristics output by the coding model corresponding to the next decoding model are spliced to obtain the splicing characteristics corresponding to the next decoding model. The coding model corresponding to the next decoding model is a coding model of which the position distance between the arrangement position of the at least one coding model and the arrangement position of the last coding model is a reference position distance, and the reference position distance is the position distance between the arrangement position of the next decoding model in the at least one decoding model and the arrangement position of the first decoding model. That is, it is assumed that the next decoding model is the kth decoding model, where K is an integer not less than 2 and not greater than (N-1), N is the total number of decoding models (i.e., the total number of coding models), and the coding model whose position distance from the arrangement position where the last coding model is located is the reference position distance is the (N-K +1) th coding model, that is, the coding model corresponding to the kth decoding model is the (N-K +1) th coding model. For example, if K is 2 and N is 5, the coding model corresponding to the 2 nd decoding model is the 4 th coding model.

Since K is an integer not less than 2 and not greater than (N-1), the decoding model corresponding to the next coding model is any one of the (N-1) th to 2 nd decoding models. As can be seen from the foregoing steps 2021 to 2023, the connection characteristics output by each of the 2 nd decoding model to the second last decoding model (i.e., the (N-1) th decoding model) are obtained in the encoding process. Therefore, the connection characteristics of the output of the coding model corresponding to the next decoding model can be directly acquired. And further splicing the decoding characteristics output by the previous decoding model and the connection characteristics output by the coding model corresponding to the next decoding model to obtain the splicing characteristics corresponding to the next decoding model.

In an exemplary embodiment, the decoding features output by the previous decoding model and the connection features output by the coding model corresponding to the next decoding model have the same size, and the manner of splicing the decoding features output by the previous decoding model and the connection features output by the coding model corresponding to the next decoding model is as follows: and connecting the decoding characteristics output by the previous decoding model and the connection characteristics output by the coding model corresponding to the next decoding model in parallel in the channel dimension.

The principle of calling the next decoding model to decode the splicing feature corresponding to the next decoding model to obtain the decoding feature output by the next decoding model is the same as the principle of calling the first decoding model to decode the target coding feature to obtain the decoding feature output by the first decoding model in step 2031, and details are not repeated here. The operation of obtaining the decoding feature output by the next decoding model by continuously executing the operation of calling the next decoding model to decode the splicing feature corresponding to the next decoding model can obtain the decoding feature output by the penultimate decoding model, and then step 2033 is executed.

Step 2033: and calling the last decoding model to decode the decoding characteristics output by the penultimate decoding model to obtain the target image characteristics.

And after the decoding characteristics output by the penultimate decoding model are obtained, calling the last decoding model to decode the decoding characteristics output by the penultimate decoding model to obtain the target image characteristics. The principle of the process is the same as that of step 2031, and is not described here again.

Illustratively, the number of the at least one coding model and the number of the at least one decoding model are three, the attention models in the first coding model and the second coding model each include a first attention model, the attention model in the third coding model includes a first attention model and a second attention model, and the processing branch of the first attention model in the coding models and the processing branch of the convolution model are in a parallel relationship. In this case, the process of acquiring the target image feature is shown in fig. 6.

And fusing the coding features obtained by calling the convolution model 601 in the first coding model to code the image to be processed 600 based on the local information and the coding features obtained by calling the first attention model 602 in the first coding model to code the image to be processed 600 based on the global information to obtain the basic features 603 output by the first coding model, and taking the basic features 603 output by the first coding model as the connection features 604 output by the first coding model.

And fusing the coding features obtained by calling a convolution model 605 in the second coding model to code the basic features 603 output by the first coding model based on local information and the coding features obtained by calling a first attention model 606 in the second coding model to code the basic features 603 output by the first coding model based on global information to obtain the basic features 607 output by the second coding model, and taking the basic features 607 output by the second coding model as the connection features 608 output by the second coding model.

Fusing the coding characteristics obtained by calling a convolution model 609 in the third coding model to code the basic characteristics 607 output by the second coding model based on local information and the coding characteristics obtained by calling a first attention model 610 in the third coding model to code the basic characteristics 607 output by the second coding model based on global information to obtain the basic characteristics 611 output by the third coding model; and calling a second attention model 612 in the third coding model to encode the basic features 611 output by the third coding model based on the global information to obtain the connection features 613 output by the third coding model, wherein the connection features 613 output by the third coding model are used as target coding features.

Calling the first decoding model (not shown in the figure) to decode the connection feature 613 (i.e. the target coding feature) output by the third coding model, so as to obtain the decoding feature 614 output by the first decoding model. A second decoding model (not shown) is called to decode the concatenated features of the decoding features 614 output by the first decoding model and the concatenated features 608 output by the second coding model, resulting in the decoding features 615 output by the second decoding model. And calling a third decoding model (not shown in the figure) to decode the splicing characteristic of the decoding characteristic 615 output by the second decoding model and the connecting characteristic 604 output by the first coding model to obtain a decoding characteristic 616 output by the third decoding model, and taking the decoding characteristic 616 output by the third decoding model as a target image characteristic.

It should be noted that, what is introduced in steps 2031 to 2033 is that, when there are a first coding model, a second-to-last coding model and a last coding model in at least one coding model (that is, the number of at least one decoding model is three or more), the decoding model is invoked to decode the target coding feature, so as to obtain the operation that needs to be executed by the target image feature. The embodiment of the present application is not limited thereto, and the number of the at least one coding model may also be one or two.

In an exemplary embodiment, for a case that the number of the at least one coding model is one, and the number of the at least one decoding model is also one, the decoding model is invoked to decode the target coding feature, and the process of obtaining the target image feature is as follows: and calling a decoding model to decode the target coding features to obtain target image features.

In an exemplary embodiment, for a case that the number of the at least one coding model is two, and the number of the at least one decoding model is also two, the process of calling the decoding model to decode the target coding feature to obtain the target image feature is as follows: calling a first decoding model to decode the target coding features to obtain decoding features output by the first decoding model; and calling a second decoding model to decode the decoding characteristics output by the first decoding model to obtain the decoding characteristics output by the second decoding model, and taking the decoding characteristics output by the second decoding model as target image characteristics.

In an exemplary embodiment, the size of the target image feature is the same as the size of the image to be processed, so that the segmentation result having the same size as the image to be processed can be obtained based on the target image feature.

In step 204, based on the target image feature, a segmentation result of the image to be processed is obtained.

After the target image characteristics are obtained, the segmentation result of the image to be processed is obtained based on the target image characteristics. The image to be processed comprises a sub-image of the reference object and a segmentation result of the image to be processed. The segmentation result of the image to be processed is used for indicating the area of the sub-image of the reference object in the image to be processed. That is, according to the segmentation result of the image to be processed, it can be known which regions in the image to be processed are the regions where the sub-images of the reference object are located.

The form of the segmentation result of the image to be processed is not limited in the embodiment of the application, and exemplarily, the form of the segmentation result of the image to be processed is a two-channel probability map, the probability map of one channel is used for displaying the probability that each pixel belongs to the sub-image of the reference object, and the probability map of the other channel is used for displaying the probability that each pixel does not belong to the sub-image of the reference object. Illustratively, the segmentation result of the image to be processed is in the form of a value pair, one pixel point corresponds to one value pair, and the value pair corresponding to one pixel point includes the position coordinate of the pixel point, the probability that the pixel point belongs to the sub-image of the reference object, and the probability that the pixel point does not belong to the sub-image of the reference object.

In one possible implementation manner, based on the target image feature, the process of obtaining the segmentation result of the image to be processed includes: and calling the target convolution layer to carry out convolution on the target image characteristics to obtain a segmentation result of the image to be processed. The target convolution layer is used for converting the target image characteristics into a segmentation result in a specified form. In an exemplary embodiment, the target convolutional layer is a convolutional layer having a convolutional kernel size of 1 × 1. The target convolutional layer may refer to one convolutional layer in the image processing model, or may refer to a single convolutional layer, which is not limited in this embodiment.

In one possible implementation manner, after obtaining the segmentation result of the image to be processed, the method further includes: and transforming the segmentation result of the image to be processed to obtain an image processing result. The segmentation result of the image to be processed is used for indicating the probability that each pixel point belongs to the sub-image of the reference object and the probability that each pixel point does not belong to the sub-image of the reference object, and the process of converting the segmentation result of the image to be processed is a process of determining whether each pixel point belongs to the sub-image of the reference object according to the probability that each pixel point belongs to the sub-image of the reference object and the probability that each pixel point does not belong to the sub-image of the reference object, which are indicated by the segmentation result of the image to be processed. And taking the result of indicating whether each pixel point belongs to the sub-image of the reference object as an image processing result. Illustratively, the image processing result may be in the form of an image or a pair of values. For example, in the case where the image processing result is in the form of a pair of values, the pair of values may be visualized as an image for easy visual observation.

Based on the implementation environment shown in fig. 1, the embodiment of the present application provides a method for training an image processing model, where the method for training an image processing model is executed by a computer device, and the computer device may be the server 12 or the terminal 11, which is not limited in this embodiment of the present application. As shown in fig. 7, the training method of the image processing model provided in the embodiment of the present application includes the following steps 701 to 705.

In step 701, a sample image, label information of the sample image, and an initial image processing model are obtained, the initial image processing model includes an initial coding model and an initial decoding model, and the initial coding model includes an initial attention model and an initial convolution model.

The sample image refers to an image required for training an initial image processing model, and illustratively includes a sub-image of a reference object. Illustratively, the sample image is the same type of image as the image to be processed in the embodiment shown in fig. 2, so as to ensure the processing effect of the trained image processing model on the image to be processed. It should be noted that the sample image mentioned in the embodiment of the present application refers to a sample image based on which an initial image processing model is trained once, and the number of the sample images may be one or multiple, which is not limited in the embodiment of the present application. Illustratively, the number of sample images is multiple to ensure the training effect of the model.

In an exemplary embodiment, the manner in which the computer device obtains the sample image is: the computer device extracts a sample image from the image library.

In an exemplary embodiment, the manner in which the computer device obtains the sample image is: the computer device takes a training image in a public dataset as a sample image. For example, the computer device takes a training image in a monnuseg (Multi-organ nuclear Segmentation) dataset as a sample image. The monnuseg dataset was obtained by accurate labeling of histopathological images of different tumor organs of multiple patients in multiple hospitals, consisting of a 40-fold magnified HE stain image downloaded from TCGA (The Cancer Genome Atlas) archive. The monnuseg dataset contains 30 training images and 14 test images, each image being 1000 × 1000 (pixels) in size. The training data covered seven different organs, breast, liver, kidney, prostate, bladder, colon, and stomach, including approximately 22000 complete nuclear boundary annotations, and the test data covered seven different organs, kidney, lung, colon, breast, bladder, prostate, and brain, including approximately 7000 complete nuclear boundary annotations.

In an exemplary embodiment, the manner in which the computer device obtains the sample image is: the computer device processes an original image collected by an image collecting device (such as a microscope, an import scanner, a home-made scanner, etc.) to obtain a sample image. In this case, the original image is extracted from the image library, or uploaded manually, and the like, which is not limited in the embodiment of the present application. The original image is processed in a manner including, but not limited to, cropping, data enhancement, and the like, which is not limited in this application. The way in which the raw image is processed is illustratively related to the computational power of the computer device and the input size required by the segmentation model.

The label information of the sample image is used for providing supervision information for the training process of the model, and the label information of the sample image is used for providing information whether each pixel point in the sample image belongs to a sub-image of a reference object. In an exemplary embodiment, the information of whether each pixel point in the sample image belongs to the sub-image of the reference object can be obtained through manual labeling, and the label information of the sample image refers to a pixel-level label of the sample image.

In an exemplary embodiment, the reference objects in the sample image are small in size, large in number and dense in arrangement, and it is difficult to directly label the reference objects manually to obtain a pixel-level label. In this case, the label information of the sample image includes at least one of a point label, a first auxiliary label, and a second auxiliary label. The training process in this case is a weakly supervised training process. The first auxiliary label and the second auxiliary label are both obtained based on the point label. The point label is determined based on a reference point within the region in which the sub-image of the reference object is located in the sample image. The first auxiliary label and the second auxiliary label can provide more supervisory information than the point label. Next, the acquisition process of the point tag, the acquisition process of the first auxiliary tag, and the acquisition process of the second auxiliary tag are described, respectively.

1. Point tag acquisition process

The point label is determined based on a reference point within the region in which the sub-image of the reference object is located in the sample image. Since the reference point is located within the area where the sub-image is located in the sample image, the reference point can provide partial position information of the sub-image. Illustratively, one sample image includes one or more sub-images of the reference object, and one sub-image of the reference object has a reference point in an area where the sample image is located, so as to roughly represent the position of the one sub-image of the reference object by using the one reference point. Which point of the sub-image of the reference object in the region of the sample image is set as the reference point empirically, or manually specified, or flexibly adjusted according to the actual application scenario, which is not limited in the embodiment of the present application. For example, the size and shape of the reference point are not limited in the embodiments of the present application, and the reference point is, for example, a square point of 1 × 1 (pixel); alternatively, the reference point is a rectangular point of 2 × 1 (pixels).

The point label includes a sub-label indicating that a pixel point located on the reference point belongs to a sub-image of the reference object and a sub-label indicating that a pixel point not located on the reference point does not belong to a sub-image of the reference object. The embodiment of the present application does not limit the form of the point label corresponding to the sample image. Illustratively, the form of the point label corresponding to the sample image is a value pair, and the value pair is composed of the position coordinate of the pixel point and the label value corresponding to the pixel point. Illustratively, the dot labels corresponding to the sample image are in the form of an image, the size of the image is the same as that of the sample image, and in the image, the pixel points located on the reference point (i.e., the pixel points of the sub-image belonging to the reference object) and the pixel points not located on the reference point (i.e., the pixel points of the sub-image not belonging to the reference object) are presented in different presentation manners.

In one possible implementation, the point labels corresponding to the sample images are generated by the computer device based on reference points manually marked in the sample images. In another possible implementation manner, the point label corresponding to the sample image is stored in correspondence with the sample image, and the point label corresponding to the sample image can be extracted while the sample image is extracted.

In one possible implementation, the sample image has a reference object label. That is, the sample image is stored in correspondence with the reference object tag, and the reference object tag can be extracted at the same time as the sample image is extracted. The reference object label is used to indicate the area in the sample image where the sub-image of the reference object is located. Illustratively, the reference object labels are obtained according to boundaries of sub-images of the reference object artificially marked in the sample image, and the reference object labels are in the form of numerical value pairs or images, which is not limited in the embodiments of the present application. Exemplarily, in the case that the sample image has the reference object tag, the process of acquiring the point tag corresponding to the sample image includes the following steps 7011 to 7014:

step 7011: based on the reference object label, the area in which the sub-image is located in the sample image is determined.

Since the reference object label is used to indicate the region in which the sub-image of the reference object is located in the sample image, the region in which the sub-image of the reference object is located in the sample image can be determined based on the reference object label.

Step 7012: the region center of the region in which the sub-image is located in the sample image is determined.

After determining the region in which the sub-image of the reference object is located in the sample image, the center of the region in which the sub-image of the reference object is located in the sample image is determined. Illustratively, the center of the area in which the sub-image of the reference object is located in the sample image refers to the center of mass of the area in which the sub-image of the reference object is located in the sample image. In an exemplary embodiment, in the case where the sample image includes sub-images of a plurality of reference objects, the regions in which the sub-images of different reference objects are located in the sample image are different, and the center of the region is also different for the different regions.

Step 7013: based on the region center, a reference point is determined.

The position of the center of the area can represent the position of the center where the sub-image of the reference object is located, and after the center of the area is determined, the reference point is determined based on the center of the area. In an exemplary embodiment, the center of the area is directly used as a reference point in the area where the sub-image of the reference object is located in the sample image, and this way, the efficiency of determining the reference point is high.

In an exemplary embodiment, the center of the region is expanded to obtain the reference point. Illustratively, expanding 3 pixel points for the center of the region; or, 5 pixel points are expanded in the center of the region. The reference point obtained by expanding the center of the area is clearly visible.

Step 7014: and determining a point label corresponding to the sample image based on the reference point.

For example, in the case that the form of the point label corresponding to the sample image is an image, based on the reference point, the way of determining the point label corresponding to the sample image is: and displaying pixel points and other pixel points on the reference point by using different presentation modes to obtain the point label in the image form corresponding to the sample image. The presentation mode is set according to experience, or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Illustratively, the rendering manner is to render color, for example, white is used to render the pixel points located on the reference point, and black is used to render other pixel points. Illustratively, the presenting manner is to present stripes, for example, horizontal stripes are used to present pixel points located on the reference point, and vertical stripes are used to present other pixel points.

For example, for the case that the point labels corresponding to the sample images are value pairs, based on the reference points, the manner of determining the point labels corresponding to the sample images is as follows: and assigning a first label value to the pixel point of which the position coordinate is positioned on the reference point, assigning a second label value to the pixel point of which the position coordinate is not positioned on the reference point, and taking the position coordinate-label value pair of all the pixel points as the point label in the form of the corresponding numerical value pair of the sample image. The first tag value and the second tag value are set empirically or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. For example, the first tag value is 1 and the second tag value is 0; alternatively, the first tag value is 0 and the second tag value is 1.

The point label corresponding to the sample image can provide partial position information of the sub-image of the reference object. Illustratively, the point labels corresponding to the sample images are called weak labels of the sample images, and the process of performing model training based on the point labels is a process of performing model training based on a weak supervised learning manner.

Exemplarily, the reference object label in the form of an image corresponding to the sample image is as shown in (a) in fig. 8, a pixel point located in an area where the sub-image of the reference object is located is represented by white, and a pixel point located outside the area where the sub-image of the reference object is located is represented by black. Based on the reference object label shown in (a) in fig. 8, a point label in the form of an image as shown in (b) in fig. 8 can be obtained. In fig. 8 (b), the pixel points located on the reference point are represented by white, and the pixel points not located on the reference point are represented by black.

2. Acquisition process of first auxiliary tag

The first auxiliary label is derived based on the point label. In one possible implementation, the process of acquiring the first auxiliary tag based on the point tag includes the following steps 701A to 701C:

step 701A: based on the point labels, a reference point within the area in which the sub-image is located in the sample image is determined.

Since the point label corresponding to the sample image is determined based on the reference point within the area where the sub-image of the reference object is located in the sample image, the reference point within the area where the sub-image is located in the sample image can be determined based on the point label. Illustratively, for the case where the dot labels are in the form of images, the images represent the pixel points located on the reference point in white and the other pixel points in black. In this case, the reference point can be determined from the white area appearing in the image.

Step 701B: a thiessen polygon corresponding to the reference point is generated in the sample image.

The Thiessen polygon is a continuous polygon formed by perpendicular bisectors of straight lines connecting two adjacent points, after the reference points are determined, the adjacent reference points can be connected, and then the Thiessen polygon corresponding to the reference points is obtained based on the perpendicular bisectors of the straight lines connecting the adjacent reference points. The Thiessen polygon divides the plane where the sample image is located into ideal polygon blocks. In the embodiment of the present application, it is considered that the pixel points located on the thieson polygon are pixel points of the sub-image that does not belong to the reference object, that is, the thieson polygon can provide reliable negative samples, which can help the sub-images of the highly aggregated reference object not to overlap when being segmented.

Step 701C: and acquiring a first auxiliary label based on the reference point and the Thiessen polygon, wherein the first auxiliary label comprises a sub-label for indicating that the pixel point positioned on the reference point belongs to the sub-image, a sub-label for indicating that the pixel point positioned on the Thiessen polygon does not belong to the sub-image, and a sub-label for indicating that the pixel point positioned outside the reference point and the Thiessen polygon belongs to the uncertain pixel point.

After the reference point is determined and the Thiessen polygon corresponding to the reference point is generated, a first auxiliary label is obtained based on the reference point and the Thiessen polygon. According to the first auxiliary label, which pixel points in the sample image belong to the sub-image of the reference object, which pixel points do not belong to the sub-image of the reference object, and which pixel points belong to uncertain pixel points can be determined, so that powerful supervision information can be provided for the training process of the model.

The manner in which the first auxiliary label is obtained is related to the form of the first auxiliary label, based on the reference point and the Thiessen polygon. Illustratively, the first auxiliary label is in the form of an image, and then the image which has the same size as the sample image and presents the pixel points located on the reference point, the pixel points located on the thiessen polygon and other pixel points by using different presentation modes is taken as the first auxiliary label. For example, as shown in fig. 9, in the image shown in fig. 9, the pixel points located on the reference point are represented by white, the pixel points located on the thieson polygon are represented by gray, and the pixel points located outside the reference point and the thieson polygon are represented by black. In this case, the sub-label indicating that the pixel point located on the reference point belongs to the sub-image is in the form of an image represented by white; the sub-label used for indicating that the pixel point positioned on the Thiessen polygon does not belong to the sub-image is in the form of an image presented by gray; the sub-labels indicating that the pixel points located outside the reference point and the Thiessen polygon belong to uncertain pixel points are in the form of images rendered in black.

Illustratively, the first auxiliary label is in the form of a value pair, and based on the reference point and the Thiessen polygon, the first auxiliary label is obtained by: and giving a first value to the pixel point positioned on the reference point, giving a second value to the pixel point positioned on the Thiessen polygon, giving a third value to the pixel point positioned outside the reference point and the Thiessen polygon, and taking the position coordinate-label value pair of each pixel point as a first auxiliary label in a numerical value pair form. In this case, the sub-label indicating that the pixel point located on the reference point belongs to the sub-image is in the form of a numerical value pair including a label value as a first value; the sub-label used for indicating that the pixel point positioned on the Thiessen polygon does not belong to the sub-image is in the form of a numerical value pair which comprises a label value as a second value; the sub-label indicating that the pixel points located outside the reference point and the Thiessen polygon belong to the uncertain pixel points is in the form of a numerical value pair including a label value as a third value. The first value, the second value and the third value are set empirically or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. For example, the first value is 1, the second value is 0, and the third value is-1.

In an exemplary embodiment, the first auxiliary label acquired based on the reference point and the Thiessen polygon may also be referred to as a point-edge label.

3. Acquisition procedure of second auxiliary tag

The second auxiliary label is derived based on the point label. In one possible implementation, the process of acquiring the second auxiliary tag based on the point tag includes the following steps 701a to 701 d:

step 701 a: based on the point labels, a reference point within the area in which the sub-image is located in the sample image is determined.

The implementation of step 701A is referred to above as step 701A, and is not described herein again.

Step 701 b: and acquiring reference characteristics corresponding to all pixel points in the sample image respectively based on the reference points.

The reference features corresponding to the pixel points are features on which the pixel points are clustered, and the reference features corresponding to the pixel points are determined based on the reference points and are related to the reference points. The principle of obtaining the reference features corresponding to each pixel point in the sample image is the same, and a way of obtaining the reference feature corresponding to a first pixel point based on a reference point is described by taking any one pixel point (called as the first pixel point) in each pixel point as an example.

In one possible implementation manner, based on the reference point, the manner of obtaining the reference feature corresponding to the first pixel point is as follows: taking the distance between the first pixel point and a target reference point as a distance characteristic corresponding to the first pixel point, wherein the target reference point is a reference point closest to the first pixel point; and acquiring a reference characteristic corresponding to the first pixel point based on the distance characteristic corresponding to the first pixel point and the color characteristic of the first pixel point.

By calculating the distance between the first pixel point and each reference point, the target reference point closest to the first pixel point can be determined. The embodiment of the present application does not limit the manner of calculating the distance between two points, and exemplarily, calculates the euclidean distance between two points. And after the target reference point is determined, taking the distance between the first pixel point and the target reference point as the distance characteristic corresponding to the first pixel point.

Illustratively, the sample image is an image stained by a stain, and different pixel points in the sample image may be stained with different colors, for example, for the case where the reference object is a cell nucleus and the stain is an HE stain, pixel points of the sub-image belonging to the cell nucleus in the sample image are stained with blue, and pixel points of the sub-image belonging to the extracellular matrix and cytoplasm are stained with pink. Therefore, in the process of obtaining the reference feature corresponding to the first pixel point, the color feature of the first pixel point is also considered in addition to the distance feature. Exemplarily, the color feature of the first pixel point refers to a color value of the first pixel point under each color component of the first color space. For example, if the first color space is an RBG color space, the color feature of the first pixel point can be represented as (r)_i,g_i,b_i)。

The reference feature corresponding to the first pixel point is based on the first pixelThe distance characteristic corresponding to the point and the color characteristic of the first pixel point are obtained, illustratively, the first pixel point x_iThe corresponding distance characteristic is denoted d_iThe first pixel point x_iHas a color characteristic of (r)_i,g_i,b_i) Then the first pixel point x_iThe corresponding reference feature can be denoted as f_xi＝(d_i,r_i,g_i,b_i)。

The color difference between the pixel point of the sub-image belonging to the reference object and the pixel point of the sub-image not belonging to the reference object is large, so that the color characteristics of the pixel point are considered in the process of acquiring the reference characteristics corresponding to the pixel point, so that the pixel points of different categories are divided into different cluster clusters to a certain extent by utilizing the color characteristics. In addition, apart from considering the color characteristics, distance characteristics are also considered so as to avoid the adverse effect of uneven dyeing on the clustering result. On the premise of similar colors, pixel points of the sub-images belonging to the same reference object should be close enough to a reference point corresponding to the sub-image of the reference object, and the pixel points of the sub-images not belonging to the reference object have larger chromatic aberration and are far enough away from the reference point, so that the distance between the pixel point and the nearest reference point is taken as a distance characteristic, and different types of pixel points can be classified into different cluster clusters to a certain extent by utilizing the distance characteristic.

By referring to the manner of obtaining the reference features corresponding to the first pixel points, the reference features corresponding to the respective pixel points in the sample image can be obtained.

Step 701 c: and clustering the pixel points based on the reference characteristics corresponding to the pixel points respectively to obtain a clustering result, wherein the clustering result comprises a first clustering cluster and a second clustering cluster.

After the reference characteristics corresponding to the pixel points are obtained, clustering is carried out on the pixel points based on the reference characteristics corresponding to the pixel points.

In an exemplary embodiment, the parameters are respectively corresponding based on each pixel pointConsidering characteristics, before clustering each pixel point, the number K (K is an integer not less than 1) of the finally-acquired clustering clusters is specified, and then clustering of each pixel point is realized based on a K-Means (K-average) clustering method. In an exemplary embodiment, the process of clustering each pixel point based on the K-Means clustering method is as follows: for a given pixel comprising N (N is an integer not less than 1) pixels (x)₁,x₂,…,x_N) According to the reference feature f corresponding to the pixel point_xi(i ═ 1,2, …, N) divides N pixels into K clusters (C ═ C)₁,C₂,…C_K) The difference of the reference characteristics among the same type of pixel points is as small as possible, and the difference of the reference characteristics among different types of pixel points is as large as possible. The objective of the K-Means clustering method is to minimize the square error E, which is calculated based on equation 4:

wherein μ j represents a cluster C_jThe calculation formula of μ j is:

in the embodiment of the present application, K is an integer not less than 2, so that the pixel points are divided into at least two types of pixel points, namely, pixel points of the sub-image belonging to the reference object and pixel points of the sub-image not belonging to the reference object through clustering, and thus a clustering result including the first clustering cluster and the second clustering cluster is obtained. The first cluster is the cluster which is closest to the reference point and contains the pixel point in each cluster in the clustering result, and the second cluster is the cluster which is farthest from the reference point in each cluster in the clustering result. Based on the mode, the first cluster is considered to be a cluster formed by pixel points of the subimages belonging to the reference object, and the second cluster is considered to be a cluster formed by pixel points of the subimages not belonging to the reference object. Illustratively, the way of calculating the distance between the pixel point contained in the cluster and the reference point is: and taking the average value of the distance characteristics corresponding to each pixel point in the cluster as the distance between the pixel point contained in the cluster and the reference point.

Step 701 d: and acquiring a second auxiliary label based on the clustering result, wherein the second auxiliary label comprises a sub-label for indicating that the pixel points in the first clustering cluster belong to the subimages, a sub-label for indicating that the pixel points in the second clustering cluster do not belong to the subimages, and a sub-label for indicating that the pixel points except the pixel points in the first clustering cluster and the second clustering cluster belong to uncertain pixel points.

The second auxiliary label is obtained based on the clustering result, which pixel points in the sample image belong to the subimage of the reference object, which pixel points do not belong to the subimage of the reference object, and which pixel points belong to uncertain pixel points can be determined according to the second auxiliary label, so that powerful supervision information can be provided for the training process of the model. Exemplarily, a region where a pixel point of the sub-image not belonging to the reference object is located is referred to as a background, and a region where a pixel point belonging to the uncertain pixel point is located is referred to as an uncertain region at a boundary between the region where the sub-image of the reference object is located and the background.

Based on the clustering result, the manner in which the second auxiliary label is obtained is related to the form of the second auxiliary label. Exemplarily, the second auxiliary label is in the form of an image, and then the image which has the same size as the sample image and is presented by using different presentation manners as pixel points in the first cluster, pixel points in the second cluster, and pixel points except for the pixel points in the first cluster and the second cluster is taken as the second auxiliary label.

In an exemplary embodiment, in the process of presenting the pixel points in the first cluster, the pixel points in the second cluster, and the pixel points other than the pixel points in the first cluster and the second cluster by using different presentation manners, the pixel points in the first cluster, the pixel points in the second cluster, and the pixel points other than the pixel points in the first cluster and the second cluster may be presented directly by using different presentation manners, or the pixel points obtained after performing a morphological opening operation on the pixel points in the first cluster, the pixel points obtained after performing a morphological opening operation on the pixel points in the second cluster, and the pixel points obtained after performing a morphological opening operation on the pixel points other than the pixel points in the first cluster and the second cluster may be presented by using different presentation manners.

For example, a second auxiliary label in the form of an image is shown in fig. 10. In fig. 10, white is used to represent the pixel points in the first cluster, gray is used to represent the pixel points in the second cluster, and black is used to represent the pixel points except the pixel points in the first cluster and the second cluster. Based on this, it is considered that the pixel points in the white area in fig. 10 belong to the sub-image of the reference object, the pixel points in the gray area do not belong to the sub-image of the reference object, and the pixel points in the black area belong to the uncertain pixel points. In this case, the sub-label used for indicating that the pixel point in the first cluster belongs to the sub-image is in a form of an image presented by white; the sub-label used for indicating that the pixel points in the second cluster do not belong to the sub-image is in the form of an image presented by gray; the sub-label used for indicating that the pixel points except the pixel points in the first cluster and the second cluster belong to uncertain pixel points is in the form of an image presented by black.

Illustratively, the second auxiliary label is in the form of a value pair, and based on the clustering result, the manner of obtaining the second auxiliary label is as follows: and assigning a fourth value to the pixel points in the first cluster, assigning a fifth value to the pixel points in the second cluster, assigning a sixth value to the pixel points except the pixel points in the first cluster and the second cluster, and taking the position coordinate-label value pair of each pixel point as a second auxiliary label in a numerical value pair form. In this case, the sub-label used for indicating that the pixel point in the first cluster belongs to the sub-image is in a form of a numerical value pair including a label value of a fourth value; the sub-label used for indicating that the pixel points in the second cluster do not belong to the sub-image is in a numerical value pair with a label value of a fifth value; and the sub-label used for indicating that the pixel points except the pixel points in the first cluster and the second cluster belong to the uncertain pixel points is in a numerical value pair with a sixth value as a label value.

The fourth value, the fifth value and the sixth value are set empirically or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Illustratively, the fourth value is the same as the first value (e.g., 1), the fifth value is the same as the second value (e.g., 0), and the sixth value is the same as the third value (e.g., -1).

Clustering based on the K-Means clustering method belongs to unsupervised clustering, and a second auxiliary label is obtained according to color feature difference and distance feature difference between pixel points of sub-images belonging to a reference object and pixel points of sub-images not belonging to the reference object in the embodiment of the application. Exemplarily, the second auxiliary label may also be referred to as a cluster label.

The label information of the sample image includes at least one of a point label, a first auxiliary label, or a second auxiliary label, and the label information of the sample image can be acquired according to the point label acquisition process, the first auxiliary label acquisition process, and the second auxiliary label acquisition process.

The initial image processing models are to-be-trained image processing models, and are all used for processing the images so as to segment sub-images of the reference object in the images and obtain segmentation results of the images. The initial image processing model includes an initial coding model and an initial decoding model, the initial coding model including an initial attention model and an initial convolution model. Description of the initial image processing model referring to fig. 2, the description of the image processing model in the embodiment is not repeated here.

In step 702, an initial attention model and an initial convolution model are called to encode the sample image based on the global information and the local information, and the sample encoding characteristics are obtained.

The implementation of this step 702 refers to step 202 in the embodiment shown in fig. 2, and is not described here again.

In step 703, an initial decoding model is called to decode the sample coding features, so as to obtain sample image features.

The implementation of this step 703 refers to step 203 in the embodiment shown in fig. 2, and is not described here again.

In step 704, based on the sample image features, a segmentation result of the sample image is obtained.

The implementation of this step 704 is referred to step 204 in the embodiment shown in fig. 2, and is not described here again.

In step 705, an initial image processing model is trained based on the segmentation result of the sample image and the label information of the sample image, so as to obtain an image processing model.

After the segmentation result of the sample image is obtained, the initial image processing model is trained based on the segmentation result of the sample image and the label information of the sample image, so that a trained image processing model is obtained. In one possible implementation manner, based on the segmentation result of the sample image and the label information of the sample image, the process of training the initial image processing model is as follows: acquiring a target loss function based on the segmentation result of the sample image and the label information of the sample image; and training the initial image processing model by using the target loss function.

If the number of the sample images is one or more, and if the number of the sample images is one, and the number of the segmentation results of the sample images is one, the target loss function is directly obtained based on the segmentation results of the sample images and the label information of the sample images. In the case where the number of sample images is plural, the number of division results of the sample images is also plural. In this case, the way to obtain the target loss function is: acquiring a sub-loss function based on the segmentation result of each sample image and the label information of the sample image; and taking the average value of all the obtained sub-loss functions as a target loss function. The manner of obtaining one sub-loss function in the case where the number of sample images is plural is the same as the manner of obtaining the target loss function in the case where the number of sample images is one. The present embodiment takes the number of sample images as an example for explanation.

The target loss function is used for reflecting the difference between the segmentation result of the sample image and the label information of the sample image. For the case where the label information of the sample image includes only one label (e.g., a pixel-level label, a point label, a first auxiliary label, a second auxiliary label), the objective loss function is obtained directly based on the segmentation result of the sample image and the one label. Illustratively, a cross entropy loss function between the segmentation result of the sample image and the one label is taken as the target loss function. Illustratively, the cross entropy loss function between the segmentation result of the sample image and the one label is calculated based on equation 5:

wherein L is_ceA cross entropy loss function representing a cross entropy loss between the segmentation result of the sample image and the one label; y represents the one label;

representing the segmentation result of the sample image.

In an exemplary embodiment, the label information of the sample image includes a plurality of labels, and in this case, based on the segmentation result of the sample image and the label information of the sample image, the manner of obtaining the target loss function is as follows: respectively obtaining a segmentation result of the sample image and a cross entropy loss function of each label; and acquiring a target loss function based on the segmentation result of the sample image and the cross entropy loss function of each label.

Exemplarily, if the label information of the sample image includes a point label, a first auxiliary label and a second auxiliary label, a cross entropy loss function of the segmentation result of the sample image and the point label, a cross entropy loss function of the segmentation result of the sample image and the first auxiliary label, and a cross entropy loss function of the segmentation result of the sample image and the second auxiliary label need to be obtained respectively. And then, acquiring a target loss function based on the three cross entropy loss functions, and constraining the training process of the model by using the target loss function.

In an exemplary embodiment, in the process of obtaining a cross entropy loss function between a segmentation result of a sample image and a certain label, the cross entropy loss function is calculated based on valid pixel points in the sample image, and invalid pixel points in the sample image are ignored. Illustratively, the valid pixel points in the sample image refer to pixel points (also referred to as pixel points representing a positive sample) of the sub-image belonging to the reference object and pixel points (also referred to as pixel points representing a negative sample) of the sub-image not belonging to the reference object, which are indicated by the label, and the invalid pixel points in the sample image refer to pixel points belonging to uncertain pixel points, which are indicated by the label. Exemplarily, taking a certain label as the first auxiliary label shown in fig. 9 or the second auxiliary label shown in fig. 10 as an example, the effective pixel points in the sample image are the pixel points in the white area and the gray area in fig. 9 and 10, and the ineffective pixel points in the sample image are the pixel points in the black area in fig. 9 and 10.

The embodiment of the application does not limit the specific implementation manner of obtaining the target loss function based on the segmentation result of the sample image and the cross entropy loss function of each label. Illustratively, a weighted sum of the segmentation results of the sample image and the cross-entropy loss functions of the respective labels is taken as the target loss function. In the process of calculating the weighted sum, weights corresponding to the segmentation result of the sample image and the cross entropy loss function of each label are set according to experience, or are flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. For example, if the weights corresponding to the segmentation result of the sample image and the cross entropy loss function of each label are all 1, the target loss function is the sum of the segmentation result of the sample image and the cross entropy loss function of each label.

After the target loss function is obtained, the initial image processing model is trained by using the target loss function, and the image processing model is obtained. In an exemplary embodiment, the process of training the initial image processing model with the objective loss function is an iterative process: reversely updating parameters of each model in the initial image processing model by using the target loss function; judging whether a training process meets a training termination condition every time the parameters are updated; and if the training process meets the training termination condition, stopping the iterative process, and taking the model obtained by training as a well-trained image processing model.

If the training process does not meet the training termination condition, acquiring a new target loss function according to the modes from the step 701 to the step 705, and reversely updating the parameters of the image processing model by using the new target loss function. And repeating the steps until the training process meets the training termination condition to obtain the trained image processing model. It should be noted that, in the process of obtaining a new target loss function according to the manner from step 701 to step 705, the sample image may be changed or not, which is not limited in the embodiment of the present application.

Illustratively, the comparison result of model performance between the image processing model obtained by training based on the method provided by the embodiment of the present application and the image processing model composed of the convolution model and the decoding model in the related art is shown in table 1:

TABLE 1

Image processing model	Accuracy of	F1 value
			Prior Art	0.8989	0.7473
This application	0.9007	0.9399

As can be seen from table 1, the image processing model trained based on the method provided in the embodiment of the present application can obtain higher accuracy and higher F1 value (an index for measuring model performance) than the image processing model in the related art.

Based on the technical scheme provided by the embodiment of the application, the image processing model comprising the attention model and the convolution model in the coding model can be trained, a foundation is laid for calling the attention model and the convolution model to obtain the target coding features based on the global information and the local information, the target image features can be obtained by comprehensively paying attention to the global information and the local information, the information concerned is rich, the reliability of the target image features is high, and the accuracy of the obtained segmentation results is improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario is described.

In an exemplary embodiment, the image processing method provided in the embodiment of the present application can be applied to an application scenario in which a tissue pathology image is processed to segment a sub-image of a cell nucleus in the tissue pathology image, where in the application scenario, an image to be processed is the tissue pathology image, and the tissue pathology image is an image obtained by image-capturing a certain area of a field of view in a pathology slide obtained after staining with an HE stain. The sub-image of the reference object included in the image to be processed refers to a sub-image of a cell nucleus stained by the H-stain component in the histopathological image.

Referring to fig. 11, the method of processing a tissue pathology image includes the following steps 1101 to 1104.

In step 1101, a histopathology image to be processed and an image processing model are obtained, the image processing model includes a coding model and a decoding model, the coding model includes an attention model and a convolution model, the attention model is coded based on global information, and the convolution model is coded based on local information.

The image processing model is obtained by training the initial image processing model based on the sample tissue pathological image and the label information of the sample tissue pathological image. Illustratively, the label information of the sample tissue pathology image includes at least one of a point label, a first auxiliary label, and a second auxiliary label.

In step 1102, an attention model and a convolution model are called to encode the histopathology image to be processed based on the global information and the local information to obtain target encoding characteristics.

In step 1103, a decoding model is called to decode the target coding features, so as to obtain target image features.

In step 1104, based on the target image feature, a segmentation result of the histopathological image to be processed is obtained.

The segmentation result of the histopathological image to be processed is used for indicating the area where the sub-image of the cell nucleus in the image to be processed is located.

The implementation of the above steps 1101 to 1104 is shown in fig. 2, and is not described herein again.

The method provided by the embodiment of the application can be applied to automatic analysis of histopathology images, and the characteristics of average size, density, arrangement and the like of cell nuclei can be obtained through subsequent calculation of segmentation results of sub-images of the cell nuclei, so that clinical diagnosis and treatment of cancers of different types, cancer grading, patient risk stratification and the like are realized. In addition, besides the segmentation of the sub-images of the cell nuclei, the method provided by the embodiment of the application can also be applied to the segmentation of the sub-images of the cells, or the segmentation of the sub-images of tissues which are small in size, large in number and dense in arrangement. The principle of segmentation of sub-images of cells and of sub-images of other tissues is the same as that of segmentation of sub-images of nuclei, and will not be described herein again.

Referring to fig. 12, an embodiment of the present application provides an image processing apparatus, including:

the image processing system comprises a first obtaining unit 1201 and a second obtaining unit, wherein the first obtaining unit 1201 is used for obtaining an image to be processed and an image processing model, the image processing model comprises a coding model and a decoding model, the coding model comprises an attention model and a convolution model, the attention model is coded based on global information, and the convolution model is coded based on local information;

a second obtaining unit 1202, configured to invoke an attention model and a convolution model to encode the image to be processed based on the global information and the local information, so as to obtain a target encoding feature;

a third obtaining unit 1203, configured to invoke a decoding model to decode the target coding feature, so as to obtain a target image feature;

a fourth obtaining unit 1204, configured to obtain a segmentation result of the image to be processed based on the target image feature.

In a possible implementation manner, the number of the coding models is at least one, and the second obtaining unit 1202 is configured to invoke an attention model and a convolution model in the first coding model to code the image to be processed based on the global information and the local information, so as to obtain a basic feature and a connection feature output by the first coding model; starting from a second coding model, calling an attention model and a convolution model in a next coding model to code the basic features output by the last coding model based on global information and local information to obtain the basic features and connection features output by the next coding model until the basic features and connection features output by a penultimate coding model are obtained, wherein the connection features output by each coding model from the first coding model to the penultimate coding model are used for providing data support for the step of executing calling a decoding model to decode the target coding features to obtain the target image features; and calling an attention model and a convolution model in the last coding model to code the basic features output by the penultimate coding model based on the global information and the local information to obtain the connection features output by the last coding model, and taking the connection features output by the last coding model as target coding features.

In a possible implementation manner, the attention model in the first coding model includes a first attention model, and the second obtaining unit 1202 is further configured to invoke a convolution model and the first attention model in the first coding model to code the image to be processed based on the local information and the global information, so as to obtain a basic feature output by the first coding model; and acquiring the connection characteristics output by the first coding model based on the basic characteristics output by the first coding model.

In a possible implementation manner, the second obtaining unit 1202 is further configured to invoke a convolution model in the first coding model to code the image to be processed based on the local information, so as to obtain a first coding feature; calling a first attention model to encode the image to be processed based on the global information to obtain a second encoding characteristic; fusing the first coding feature and the second coding feature to obtain a fused feature; and acquiring the basic characteristics output by the first coding model based on the fusion characteristics.

In a possible implementation manner, the second obtaining unit 1202 is further configured to obtain block features of each image block of the image to be processed, and map the block features of each image block to obtain mapping features of each image block; acquiring the position characteristics of an image block; acquiring reference characteristics of the image to be processed based on the mapping characteristics and the image block position characteristics of each image block; and calling the first attention model to encode the reference features of the image to be processed based on the global information to obtain second encoding features.

In a possible implementation manner, the first attention model includes an attention module and a non-linear processing module, and the second obtaining unit 1202 is further configured to invoke the attention module to process the reference feature, so as to obtain a first intermediate feature; splicing the first intermediate characteristic and the reference characteristic to obtain a characteristic to be processed; calling a nonlinear processing module to process the feature to be processed to obtain a second intermediate feature; and splicing the second intermediate characteristic and the characteristic to be processed to obtain a second coding characteristic.

In a possible implementation manner, the second obtaining unit 1202 is further configured to invoke a convolution model in the first coding model to code the image to be processed based on the local information, so as to obtain a first coding feature; calling a first attention model to encode the first encoding characteristic based on the global information to obtain a third encoding characteristic; and acquiring the basic characteristics output by the first coding model based on the third coding characteristics.

In a possible implementation manner, the second obtaining unit 1202 is further configured to invoke the first attention model to encode the image to be processed based on the global information, so as to obtain a second encoding feature; calling a convolution model in the first coding model to code the second coding feature based on the local information to obtain a fourth coding feature; and acquiring the basic characteristics output by the first coding model based on the fourth coding characteristics.

In a possible implementation manner, the attention model in the first coding model includes a second attention model, and the second obtaining unit 1202 is further configured to call a convolution model in the first coding model to code the image to be processed based on the local information, so as to obtain a basic feature output by the first coding model; and calling a second attention model to encode the basic features output by the first coding model based on the global information to obtain the connection features output by the first coding model.

Referring to fig. 13, an embodiment of the present application provides an apparatus for training an image processing model, including:

a first obtaining unit 1301, configured to obtain a sample image, label information of the sample image, and an initial image processing model, where the initial image processing model includes an initial coding model and an initial decoding model, and the initial coding model includes an initial attention model and an initial convolution model;

a second obtaining unit 1302, configured to invoke an initial attention model and an initial convolution model to encode the sample image based on the global information and the local information, so as to obtain a sample encoding characteristic;

a third obtaining unit 1303, configured to invoke the initial decoding model to decode the sample coding features, so as to obtain sample image features;

a fourth obtaining unit 1304, configured to obtain a segmentation result of the sample image based on the sample image feature;

the training unit 1305 is configured to train the initial image processing model based on the segmentation result of the sample image and the label information of the sample image, so as to obtain the image processing model.

In one possible implementation, the sample image includes a sub-image of the reference object, the label information of the sample image includes at least one of a point label, a first auxiliary label, or a second auxiliary label, the first auxiliary label and the second auxiliary label are both derived based on the point label, and the point label is determined based on a reference point within an area where the sub-image is located in the sample image.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional units is illustrated, and in practical applications, the above functions may be distributed by different functional units according to needs, that is, the internal structure of the apparatus may be divided into different functional units to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

In an exemplary embodiment, a computer device is also provided, the computer device comprising a processor and a memory, the memory having at least one computer program stored therein. The at least one computer program is loaded and executed by one or more processors to cause the computer apparatus to implement any one of the image processing methods or the training method of the image processing model described above. The computer device may be a server or a terminal, which is not limited in this embodiment of the present application. Next, the structures of the server and the terminal will be described, respectively.

Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application, where the server may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the one or more memories 1402 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1401, so as to enable the server to implement the image Processing method or the training method of the image Processing model provided in the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

Fig. 15 is a schematic structural diagram of a terminal according to an embodiment of the present application. Illustratively, the terminal may be: a smartphone, a tablet, a laptop, or a desktop computer. A terminal may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, a terminal includes: a processor 1501 and memory 1502.

Processor 1501 may include one or more processing cores. Processor 1501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1501 may be integrated with a GPU that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, processor 1501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 1502 may include one or more computer-readable storage media, which may be non-transitory. The memory 1502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1502 is configured to store at least one instruction for execution by the processor 1501 to cause the terminal to implement the image processing method or the training method of the image processing model provided in the method embodiments of the present application.

In some embodiments, the terminal may further include: a peripheral interface 1503 and at least one peripheral. The processor 1501, memory 1502, and peripheral interface 1503 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1503 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1504, a display 1505, a camera assembly 1506, an audio circuit 1507, a positioning assembly 1508, and a power supply 1509.

The peripheral interface 1503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1501 and the memory 1502. The Radio Frequency circuit 1504 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 1504 communicates with communication networks and other communication devices via electromagnetic signals. The display screen 1505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. The camera assembly 1506 is used to capture images or video.

The audio circuitry 1507 may include a microphone and speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1501 for processing or inputting the electric signals to the radio frequency circuit 1504 to realize voice communication. The speaker is used to convert electrical signals from the processor 1501 or the radio frequency circuit 1504 into sound waves. The positioning component 1508 is used to locate the current geographic Location of the terminal to implement navigation or LBS (Location Based Service). A power supply 1509 is used to supply power to the various components in the terminal. The power supply 1509 may be alternating current, direct current, disposable or rechargeable.

In some embodiments, the terminal also includes one or more sensors 1510. The one or more sensors 1510 include, but are not limited to: acceleration sensor 1511, gyro sensor 1512, pressure sensor 1513, fingerprint sensor 1514, optical sensor 1515, and proximity sensor 1516.

The acceleration sensor 1511 may detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the terminal. The gyroscope sensor 1512 can detect the body direction and the rotation angle of the terminal, and the gyroscope sensor 1512 and the acceleration sensor 1511 can cooperate to collect the 3D motion of the user on the terminal. The pressure sensor 1513 may be provided on a side frame of the terminal and/or under the display 1505. When the pressure sensor 1513 is disposed on the side frame of the terminal, the holding signal of the user to the terminal can be detected, and the processor 1501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1513. When the pressure sensor 1513 is disposed at a lower layer of the display screen 1505, the processor 1501 controls the operability control on the UI interface in accordance with the pressure operation of the user on the display screen 1505.

The fingerprint sensor 1514 is configured to capture a fingerprint of the user, and the processor 1501 identifies the user based on the fingerprint captured by the fingerprint sensor 1514, or the fingerprint sensor 1514 identifies the user based on the captured fingerprint. The optical sensor 1515 is used to collect ambient light intensity. A proximity sensor 1516, also known as a distance sensor, is typically provided on the front panel of the terminal. The proximity sensor 1516 is used to collect a distance between the user and the front surface of the terminal.

Those skilled in the art will appreciate that the configuration shown in fig. 15 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In an exemplary embodiment, there is also provided a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor of a computer apparatus to cause the computer to implement any one of the image processing method or the training method of the image processing model described above.

In one possible implementation, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any one of the image processing methods or the training method of the image processing model described above.

It should be noted that the terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The implementations described in the above exemplary embodiments do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method according to claim 1, wherein the number of the coding models is at least one, and the invoking the attention model and the convolution model codes the image to be processed based on global information and local information to obtain a target coding feature comprises:

calling an attention model and a convolution model in a first coding model to code the image to be processed based on global information and local information to obtain basic characteristics and connection characteristics output by the first coding model;

starting from a second coding model, calling an attention model and a convolution model in a next coding model to code the basic features output by a previous coding model based on global information and local information to obtain the basic features and connection features output by the next coding model until the basic features and connection features output by a penultimate coding model are obtained, wherein the connection features output by each coding model from the first coding model to the penultimate coding model are used for providing data support for executing the step of calling the decoding model to decode the target coding features to obtain target image features;

and calling an attention model and a convolution model in the last coding model to code the basic features output by the penultimate coding model based on global information and local information to obtain the connection features output by the last coding model, and taking the connection features output by the last coding model as the target coding features.

3. The method according to claim 2, wherein the attention model in the first coding model comprises a first attention model, and the calling the attention model and the convolution model in the first coding model to code the image to be processed based on the global information and the local information to obtain the basic feature and the connection feature output by the first coding model comprises:

calling a convolution model in the first coding model and the first attention model to code the image to be processed based on local information and global information to obtain basic characteristics output by the first coding model;

and acquiring the connection characteristics output by the first coding model based on the basic characteristics output by the first coding model.

4. The method according to claim 3, wherein said invoking a convolution model in the first coding model and the first attention model to code the image to be processed based on local information and global information to obtain a base feature output by the first coding model, comprises:

calling a convolution model in the first coding model to code the image to be processed based on local information to obtain a first coding characteristic;

calling the first attention model to encode the image to be processed based on global information to obtain a second encoding characteristic;

fusing the first coding feature and the second coding feature to obtain a fused feature;

and acquiring the basic characteristics output by the first coding model based on the fusion characteristics.

5. The method of claim 4, wherein the invoking the first attention model to encode the image to be processed based on global information to obtain a second encoding feature comprises:

obtaining the block characteristics of each image block of the image to be processed, and mapping the block characteristics of each image block to obtain the mapping characteristics of each image block;

acquiring the position characteristics of an image block;

acquiring reference characteristics of the image to be processed based on the mapping characteristics of each image block and the position characteristics of the image block;

and calling the first attention model to encode the reference features of the image to be processed based on global information to obtain the second encoding features.

6. The method of claim 5, wherein the first attention model comprises an attention module and a non-linear processing module, and the invoking the first attention model to encode the reference feature of the image to be processed based on global information to obtain the second encoded feature comprises:

calling the attention module to process the reference feature to obtain a first intermediate feature;

splicing the first intermediate feature and the reference feature to obtain a feature to be processed;

calling the nonlinear processing module to process the to-be-processed features to obtain second intermediate features;

and splicing the second intermediate characteristic and the characteristic to be processed to obtain the second coding characteristic.

7. The method according to claim 3, wherein said invoking a convolution model in the first coding model and the first attention model to code the image to be processed based on local information and global information to obtain a base feature output by the first coding model, comprises:

calling the first attention model to encode the first encoding characteristic based on global information to obtain a third encoding characteristic;

and acquiring the basic characteristics output by the first coding model based on the third coding characteristics.

8. The method according to claim 3, wherein said invoking a convolution model in the first coding model and the first attention model to code the image to be processed based on local information and global information to obtain a base feature output by the first coding model, comprises:

calling a convolution model in the first coding model to code the second coding feature based on local information to obtain a fourth coding feature;

and acquiring the basic characteristics output by the first coding model based on the fourth coding characteristics.

9. The method of claim 2, wherein the attention model in the first coding model comprises a second attention model, and the calling the attention model and the convolution model in the first coding model to encode the image to be processed based on the global information and the local information to obtain the basic feature and the connection feature output by the first coding model comprises:

calling a convolution model in the first coding model to code the image to be processed based on local information to obtain basic characteristics output by the first coding model;

and calling the second attention model to encode the basic features output by the first coding model based on global information to obtain the connection features output by the first coding model.

10. A method of training an image processing model, the method comprising:

11. The method of claim 10, wherein the sample image comprises a sub-image of a reference object, wherein the label information of the sample image comprises at least one of a point label, a first auxiliary label, or a second auxiliary label, wherein the first auxiliary label and the second auxiliary label are derived based on the point label, and wherein the point label is determined based on a reference point within a region in which the sub-image is located in the sample image.

12. An image processing apparatus, characterized in that the apparatus comprises:

13. An apparatus for training an image processing model, the apparatus comprising:

14. A computer device, characterized in that the computer device comprises a processor and a memory, in which at least one computer program is stored, which is loaded and executed by the processor, to cause the computer device to implement the image processing method according to any one of claims 1 to 9, or the training method of the image processing model according to any one of claims 10 to 11.

15. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor, to cause a computer to implement the image processing method according to any one of claims 1 to 9, or the training method of an image processing model according to any one of claims 10 to 11.