CN116109655A

CN116109655A - Image encoder processing method and device and image segmentation method

Info

Publication number: CN116109655A
Application number: CN202310089066.1A
Authority: CN
Inventors: 江彦开; 孙铭泽; 郭恒; 白晓宇; 闫轲; 许敏丰
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-05-12
Anticipated expiration: 2043-01-16
Also published as: CN116109655B

Abstract

The embodiment of the specification provides an image encoder processing method, an image encoder processing device and an image segmentation method, wherein the image encoder processing method is used for mining a first clipping image and a second clipping image with the same semantics from different first initial images and second initial images to generate two segmentation images for describing the same part so as to realize the non-deformation of a specific category by using the structural similarity of the same type of different target objects in the process of obtaining the target image encoder; and the global semantics of the first mask image and the second mask image are supplemented according to the first enhanced image and the second enhanced image, so that more accurate feature representations are learned in a contrast view mode, and downstream task processing can be performed based on the more accurate feature representations.

Description

Image encoder processing method and device and image segmentation method

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to an image encoder processing method and device and an image segmentation method.

Background

With the development of artificial intelligence technology, more and more artificial intelligence algorithms are used in intelligent film reading of medical images, so that doctors can be helped to find potential focuses more quickly, and doctor reading and the like can be assisted, and the burden of doctors can be greatly reduced. However, the labeling of data in the field of medical image segmentation is very time-consuming and labor-consuming, and requires specialized medical workers to label, which is costly; thus, the lack of high quality annotation data has been a troublesome challenge in the field of medical image segmentation since the advent of deep learning, especially in 3D (three D imens iona l, three-dimensional) tasks.

Under the condition of facing a large amount of unlabeled medical image data, how to utilize the large amount of unlabeled medical image data to perform deep learning and obtain accurate characteristic representation of the medical image data, so that the follow-up task of accurately realizing the segmentation of human tissue data in the medical image data and the like becomes the technical problem which is needed to be solved currently.

Disclosure of Invention

In view of this, the present embodiment provides an image encoder processing method. One or more embodiments of the present specification relate to an image encoder processing apparatus, an image segmentation method, an image segmentation apparatus, a computing device, a computer-readable storage medium, and a computer program that solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present specification, there is provided an image encoder processing method, including:

determining a first clipping image and a second clipping image with the same semantics from a first initial image and a second initial image, wherein the first initial image and the second initial image are images with the same type and containing different target objects;

Performing mask processing on the first clipping image and the second clipping image to obtain a first mask image and a second mask image;

performing data enhancement on the first clipping image and the second clipping image to obtain a first enhancement image and a second enhancement image;

and adjusting network parameters of a first image encoder according to the first mask image, the second mask image, the first enhanced image and the second enhanced image to obtain a target image encoder.

According to a second aspect of embodiments of the present specification, there is provided an image encoder processing apparatus comprising:

a segmented image determination module configured to determine a first cropped image and a second cropped image that have the same semantics from a first initial image and a second initial image, wherein the first initial image and the second initial image are images of the same type that contain different target objects;

the mask processing module is configured to perform mask processing on the first clipping image and the second clipping image to obtain a first mask image and a second mask image;

the data enhancement module is configured to perform data enhancement on the first clipping image and the second clipping image to obtain a first enhancement image and a second enhancement image;

And the encoder determining module is configured to adjust network parameters of the first image encoder according to the first mask image, the second mask image, the first enhanced image and the second enhanced image to obtain a target image encoder.

According to a third aspect of embodiments of the present specification, there is provided an image segmentation method including:

determining a three-dimensional image of a target type of a target object, and inputting the three-dimensional image into a target image encoder to obtain coded image features corresponding to the three-dimensional image, wherein the target image encoder is the target image encoder in the image encoder processing method;

and inputting the characteristic of the coded image into an image segmentation model to obtain a segmented image of the target part.

According to a fourth aspect of embodiments of the present specification, there is provided an image segmentation apparatus comprising:

the characteristic obtaining module is configured to determine a three-dimensional image of a target type of a target object, and input the three-dimensional image into a target image encoder to obtain an encoded image characteristic corresponding to the three-dimensional image, wherein the target image encoder is a target image encoder in the image encoder processing method;

And the image segmentation module is configured to input the coded image features into an image segmentation model to obtain a segmented image of the target part.

According to a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the image encoder processing method or the image segmentation method described above.

According to a sixth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the above-described image encoder processing method or image segmentation method.

According to a seventh aspect of the embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described image encoder processing method or image segmentation method.

An embodiment of the present disclosure provides an image encoder processing method, including determining a first cropped image and a second cropped image with the same semantics from a first initial image and a second initial image, where the first initial image and the second initial image are images of the same type including different target objects; performing mask processing on the first clipping image and the second clipping image to obtain a first mask image and a second mask image; performing data enhancement on the first clipping image and the second clipping image to obtain a first enhancement image and a second enhancement image; and adjusting network parameters of a first image encoder according to the first mask image, the second mask image, the first enhanced image and the second enhanced image to obtain a target image encoder.

Specifically, the image encoder processing method firstly digs out a first clipping image and a second clipping image with the same semantic meaning from different first initial images and second initial images, and generates two divided images for describing the same part so as to realize modeling of a specific class of non-deformation by utilizing the structural similarity of the same type of different target objects in the process of obtaining the target image encoder; and the global semantics of the first mask image and the second mask image are supplemented according to the first enhanced image and the second enhanced image, so that more accurate feature representations are learned in a contrast view mode, and downstream task processing can be performed based on the more accurate feature representations.

Drawings

Fig. 1 is a schematic view of a specific implementation scenario of an image encoder processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of an image encoder processing method provided in one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of image processing of a segmented image in an image encoder processing method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a specific processing procedure of an image encoder processing method according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of an image segmentation method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an image encoder processing device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural view of an image segmentation apparatus according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

SSL: self-Supervised Learning, self-supervised learning.

MIM: masked Image Modeling mask image modeling is a commonly used self-supervising model.

SAM Self-supervised learning of pixel-wise anatomical embeddings in radiological images, a Self-monitoring based medical image registration framework.

CL: contrastive Learning, comparative learning.

Since the advent of deep learning, the lack of high quality annotation data has been a troublesome challenge for medical image segmentation, particularly in 3D tasks. Recent research results based on self-supervised learning (SSL) demonstrate that a strong visual representation can be obtained in an unsupervised manner. Current SSL-based methods still follow a self-supervising paradigm of specific computer vision scene designs, which may be less suitable or unreasonable when applied to medical images. In order to learn a good representation of downstream tasks (like organ segmentation), the SSL method of medical images should align features of the same anatomical structure. However, some existing approaches maximize the similarity between positive examples (randomly cropped views from the same volume) while strengthening the dissimilarity between negative examples (views from different volumes).

In three-dimensional medical segmentation tasks, common Computed Tomography (CT) and Magnetic Resonance (MR) images present human anatomy with intrinsic structure. Random clip views from the same CT (so-called positive examples) may describe completely different anatomical information, while views from different CTs (negative examples) may share some content presenting the same object (organ) type; that is, the SSL method of existing medical images ignores the inherent similarity of anatomical structures in different volumes of medical images. The body organ has an intrinsic structure, so that its appearance and layout have an intrinsic consistency in CT. In this case, views from different volumes may contain consistent spatial information and the same type due to the inherent structure of the human anatomy. Simply treating these views as negative examples ignores semantically consistent anatomical features across different CTs and forces false constraints on instance invariance.

Based on this, in the present specification, an image encoder processing method is provided. One or more embodiments of the present specification relate to an image encoder processing apparatus, an image segmentation method, an image segmentation apparatus, a computing device, a computer-readable storage medium, and a computer program, which are described in detail in the following embodiments one by one.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a specific implementation scenario of an image encoder processing method according to an embodiment of the present disclosure.

Fig. 1 includes a cloud-side device 102 and an end-side device 104, where the cloud-side device 102 may be understood as a cloud server, and of course, in another implementation, the cloud-side device 102 may be replaced by a physical server; the end side devices 104 include, but are not limited to, desktop computers, notebook computers, and the like; for easy understanding, in the embodiments of the present disclosure, the processing method of the image encoder provided in the embodiments of the present disclosure is described in detail by taking the cloud side device 102 as a cloud server and the end side device 104 as a notebook computer as an example.

As shown in fig. 1, in the implementation, the target image encoder training is performed on the cloud-side device 102; first, a first initial image and a second initial image are obtained, for example, a target image encoder obtained through training of an image encoder processing method is subsequently applied to a scene for feature encoding of a lung CT image, and then the first initial image and the second initial image in the image encoder processing method can be understood as human CT images of the same size or different sizes of two different users.

Specifically, image blocks with the same semantics are respectively determined from two human body CT images, such as a human body CT image 1 and a human body CT image 2, namely a first clipping image and a second clipping image, such as a lung CT image 1 in the human body CT image 1 and a lung CT image 2 in the human body CT image 2; performing mask processing on the first clipping image to obtain a first mask image, and performing mask processing on the second clipping image to obtain a second mask image; meanwhile, performing data enhancement (such as rotation, color noise and the like) on the first clipping image to obtain a first enhanced image, and performing data enhancement on the second clipping image to obtain a second enhanced image; and then training to obtain the target image encoder according to the first mask image, the second mask image, the first enhanced image and the second enhanced image and performing mask image modeling and contrast learning.

The cloud-side device 102 may send the target image encoder to the end-side device 104, where the end-side device 104 completes the downstream task processing in combination with other task processing models of the downstream task, such as an image segmentation model, etc.

Or when the end-side device 104 needs to use the target image encoder, the target image encoder obtained after training of the cloud-side device 102 can be called for functional use; in addition, if the computing resources and computing power of the end-side device 104 are sufficient, the target image encoder trained in the cloud-side device 102 may be deployed in the end-side device 104. The deployment implementation is specifically implemented according to practical application, and is not limited in any way herein.

According to the image encoder processing method provided by the embodiment of the specification, initial images with consistent semantics are mined from three-dimensional images (such as CT images of different human bodies) of different target objects, then self-supervision registration is carried out by adopting a pre-trained SAM model, the same parts (such as lungs, livers and hearts) are positioned in the initial images with different volumes, and two segmentation images are generated, such as a first cropping image and a second cropping image which are used for describing the parts from the same body; and then, respectively generating a mask view and an enhancement view by using two different segmentation images, respectively performing mask image modeling and contrast learning, and modeling class specificity invariance by using the same internal structure crossing different image volumes, so that the target image encoder can be helped to learn to adapt to the change of the size, the shape and the texture of a target part, the training of the target image encoder is completed, and a more accurate target image encoder is obtained.

Referring to fig. 2, fig. 2 shows a flowchart of an image encoder processing method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 202: from the first initial image and the second initial image, a first clipping image and a second clipping image with the same semantics are determined.

Wherein the first initial image and the second initial image are the same type of image containing different target objects.

Specifically, the application scenarios of the image encoder processing method provided in the embodiments of the present disclosure are different, and the corresponding target objects are also different; for example, if the image encoder processing method is applied to a human medical scene, the target object may be understood as a target human body, and then the first initial image and the second initial image may be understood as images of different real human bodies of the same type, for example, the first initial image is a CT image of the real human body 1, the second initial image is a CT image of the real human body 2, or the first initial image is an MR (Magnetic Resonance, magnetic resonance imaging) image of the real human body 1, the second initial image is an MR image of the real human body 2, or the like; if the image encoder processing method is applied to an animal medical scene, the target object can be understood as a target animal body, and then the first initial image and the second initial image can be understood as images of different real animals of the same kind and of the same type, for example, the first initial image is a CT image of a puppy a, the second initial image is a CT image of a puppy b, or the first initial image is an MR image of a puppy a, the second initial image is an MR image of a puppy b, and the like.

I.e. in a human medical scene, the target object may be understood as a real human body, and the type may be understood as an imaging type of the real human body.

For easy understanding, the embodiment of the present disclosure applies the image encoder processing method to a human medical scene, where the target object is a target human body, and the first initial image and the second initial image are images of the same human body part and different real human bodies.

Specifically, the determining, from the first initial image and the second initial image, the first clipping image and the second clipping image with the same semantics includes:

determining a first initial image and a second initial image of the same type containing different target objects, wherein the first initial image and the second initial image are different in size;

inputting the first initial image and the second initial image into a semantic detection model to obtain a first clipping image and a second clipping image with the same semantic meaning in the first initial image and the second initial image.

The first initial image and the second initial image are CT images of the same type and containing different target human bodies; as described above, the first initial image may be a CT image including the real human body 1, and the second initial image may be a CT image including the real human body 2.

Also, the semantic detection model may be understood as a pre-trained SAM model that can localize semantically identical sites in the first and second initial images.

Still take as an example that the image encoder processing method is applied to a human medical scene.

First, two first initial images and second initial images of the same imaging type containing different target human bodies are determined, wherein the first initial images and the second initial images are different in size, for example, the first initial images and the second initial images are generated through different CT machine scans, and the situation that the image sizes are different may occur. Of course, in practical applications, when the first initial image and the second initial image are images with the same size, the target image encoder may be obtained by the image encoder processing method provided in the embodiment of the present disclosure.

Then, inputting the first initial image and the second initial image into a pre-trained SAM model, wherein the SAM model performs self-supervision object detection to locate the body part with the same semantic meaning in the first initial image and the second initial image with different sizes, such as a heart or a liver in a CT image containing the object person 1 and a CT image containing the object person 2; and the body parts with the same semantics in the first initial image and the second initial image with different sizes are cropped, and the first cropping image and the second cropping image are generated and output, wherein the first cropping image can be understood as: a liver image segmented from a CT image including the target person 1; the second cropped image may be understood as: liver images segmented from CT images containing the target person 2.

In another possible embodiment, the first initial image and the second initial image may also be understood as images of smaller parts, for example the first initial image and the second initial image may be understood as: a cardiac CT image comprising a real human body 1 and a cardiac CT image comprising a real human body 2; then after inputting the first initial image and the second initial image into the semantic detection model, a first clipping image and a second clipping image are obtained, the first clipping image may be understood as a segmented image of the left atrium clipped from the cardiac CT image containing the real human body 1, the second clipping image may be understood as a segmented image of the left atrium clipped from the cardiac CT image containing the real human body 2, etc.

According to the image encoder processing method provided by the embodiment of the specification, according to the pre-trained semantic detection model, a first clipping image and a second clipping image with the same semantic meaning can be accurately obtained from a first initial image and a second initial image which contain different target objects and are of the same type; the accurate training of the target image encoder can be completed by learning similar features, dissimilar features and the like of the first cut image and the second cut image with the same semantic.

Step 204: and performing mask processing on the first clipping image and the second clipping image to obtain a first mask image and a second mask image.

The masking process may be understood as masking certain feature areas in the first cropped image and the second cropped image.

Specifically, after the first clipping image and the second clipping image with the same semantic meaning are determined, masking processing is performed on the first clipping image and the second clipping image respectively, for example, masking is performed on some image blocks in the first clipping image, masking is performed on some image blocks in the second clipping image, that is, random masking is performed on the first clipping image and the second clipping image respectively, so as to obtain a first mask image and a second mask image.

Step 206: and carrying out data enhancement on the first clipping image and the second clipping image to obtain a first enhancement image and a second enhancement image.

The data enhancement includes, but is not limited to, randomly resizing the segmented image, rotating the segmented image, flipping, scaling, or adding color noise, etc.

Specifically, the first clipping image and the second clipping image are subjected to mask processing, and data enhancement is respectively performed on the first clipping image and the second clipping image, so that a first enhancement image and a second enhancement image are obtained.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating image processing of a split image in an image encoder processing method according to an embodiment of the present disclosure.

The first initial image and the second initial image are included in fig. 3. Specifically, inputting the first initial image and the second initial image into a SAM model (specifically, inputting the first initial image and the second initial image into the SAM model, the SAM model can directly locate the parts with the same semantics in the two initial images, such as the parts in square frames in the two initial images in fig. 3, and cut and output the parts in the square frames), so as to obtain a first cut image Q and a second cut image K with the same semantics, which are cut and output by the SAM model; then, masking the first clipping image Q to obtain a first masking image

At the same time, mask processing is carried out on the second clipping image K to obtain a second mask image +.>

Furthermore, the first cropped image Q is subjected to data enhancement to obtain a first enhanced image +.>

At the same time, the second cropping image K is subjected to data enhancement to obtain a second enhanced image +.>

Where u and r in fig. 3 represent random masks, that is, random masking is performed on the first clipping image and the second clipping image; w and v represent different data enhancement modes.

Step 208: and adjusting network parameters of a first image encoder according to the first mask image, the second mask image, the first enhanced image and the second enhanced image to obtain a target image encoder.

Specifically, after the first mask image, the second mask image, the first enhanced image, and the second enhanced image are obtained according to the above steps, the network parameters of the first image encoder may be adjusted according to the first mask image, the second mask image, the first enhanced image, and the second enhanced image, to obtain the target image encoder.

According to the image encoder processing method provided by the embodiment of the specification, first, a first clipping image and a second clipping image with the same semantics are mined from different first initial images and second initial images, and two segmentation images used for describing the same part are generated, so that the purpose that a specific type of non-deformation is modeled by utilizing the same type of structural similarity of different target objects in the process of obtaining the target image encoder is achieved; and the global semantics of the first mask image and the second mask image are supplemented according to the first enhanced image and the second enhanced image, so that more accurate feature representations are learned in a contrast view mode, and downstream task processing can be performed based on the more accurate feature representations.

According to the above embodiments, the image encoder processing method provided in the embodiments of the present disclosure completes training acquisition of the target image encoder by performing mask image modeling and contrast learning; therefore, in practical application, the decoding loss function and the semantic loss function need to be calculated through the first mask image, the second mask image, the first enhanced image and the second enhanced image, so as to realize subsequent training of the target image encoder. The specific implementation mode is as follows:

the adjusting network parameters of the first image encoder according to the first mask image, the second mask image, the first enhanced image and the second enhanced image to obtain a target image encoder includes:

inputting the first mask image and the second mask image into a first image encoder to obtain a first image feature and a second image feature, and determining a first mask position in the first image feature and a second mask position in the second image feature;

inputting the first enhanced image and the second enhanced image into a second image encoder to obtain a third image feature and a fourth image feature;

Determining a decoding loss function and a semantic loss function according to the first image feature, the first mask position, the second image feature, the second mask position, the third image feature and the fourth image feature;

and adjusting network parameters of the first image encoder according to the decoding loss function and the semantic loss function to obtain a target image encoder.

Wherein the first image Encoder may be understood as an Online Encoder including, but not limited to, an Encoder of Vision Transformer architecture, and the second image Encoder may be understood as a Target Encoder.

Specifically, inputting a first mask image and a second mask image into an online encoder to obtain a first image feature corresponding to the first mask image output by the online encoder, a first mask position in the first image feature, a second image feature corresponding to the second mask image, and a second mask position in the second image feature, wherein the first image feature can be understood as an image feature of a part of the first mask image which is not blocked by a mask; similarly, the second image feature may be understood as an image feature of a portion of the second mask image that is not blocked by the mask.

In practical application, the first mask position in the first image feature may be understood as a feature position in the first image feature, which is shielded (i.e. blocked) after being processed by the mask; the second mask position in the second image feature is understood to be the position of the feature in the second image feature that is masked out after the masking process. In particular, the first mask position in the first image feature and the second mask position in the second image feature may be determined directly from the first mask image and the second mask image, that is, two sets of vectors (for representing the first mask position and the second mask position) generated directly from the first mask image and the second mask image may be used to tell which positions in the subsequent online decoder view are blocked by the mask.

According to the image encoder processing method provided by the embodiment of the specification, training of the first image encoder is realized by calculating the decoding loss function and the semantic loss function, in mask image modeling, the network (the first image encoder) recovers global overall information from local image block information, so that the relation of characteristics between the local and the overall is learned, and the understanding of the network on the visual information and scenes is facilitated; training of the first image encoder is achieved through calculation of the semantic loss function, in contrast learning, the network can learn common characteristic information and invariance of local semantic characteristics in different images, and robustness of the network is greatly improved.

And simultaneously, inputting the first enhanced image and the second enhanced image into a target encoder to obtain a third image characteristic corresponding to the first enhanced image and a fourth image characteristic corresponding to the second enhanced image output by the target encoder. In specific implementation, after the first enhanced image and the second enhanced image are input into a target encoder to perform feature encoding, the encoding features are subjected to linear processing through an MLP (Multilayer Perceptron, multi-layer perceptron) respectively, and then a third image feature corresponding to the first enhanced image and a fourth image feature corresponding to the second enhanced image are obtained. The specific implementation mode is as follows:

the inputting the first enhanced image and the second enhanced image into a second image encoder to obtain a third image feature and a fourth image feature, including:

and inputting the first enhanced image and the second enhanced image into a second image encoder to obtain a third image characteristic and a fourth image characteristic output by a multi-layer perception network of the second image encoder.

After the first image feature, the first mask position, the second image feature, the second mask position, the third image feature and the fourth image feature are determined, a decoding loss function and a semantic loss function can be calculated according to the parameters; and finally, according to the decoding loss function and the semantic loss function, adjusting network parameters of the first image encoder to train the first image encoder so as to obtain a more accurate target image encoder.

Then, the specific implementation of calculating the decoding loss function and the semantic loss function according to the first image feature, the first mask position, the second image feature, the second mask position, the third image feature and the fourth image feature is as follows:

the determining a decoding loss function and a semantic loss function according to the first image feature, the first mask position, the second image feature, the second mask position, the third image feature, and the fourth image feature includes:

inputting the first image feature and the first mask position into an image decoder to obtain a first decoded image, and inputting the second image feature and the second mask position into the image decoder to obtain a second decoded image;

obtaining a first decoding loss function according to the first clipping image and the first decoding image, and obtaining a second decoding loss function according to the second clipping image and the second decoding image;

and obtaining a semantic loss function according to the first image feature, the second image feature, the third image feature and the fourth image feature.

Wherein an image decoder may be understood as an online decoder corresponding to the first image encoder.

Specifically, after the first image feature, the first mask position, the second image feature and the second mask position are obtained, respectively inputting the first image feature and the first mask position into an online decoder for decoding to obtain a first decoded image; and inputting the second image characteristics and the second mask positions into an online decoder for decoding to obtain a second decoded image.

Then, according to the first clipping image and the first decoding image, a first decoding loss function is obtained; obtaining a second decoding loss function according to the second clipping image and the second decoding image; i.e. the decoding loss function comprises a first decoding loss function and a second decoding loss function.

And simultaneously, calculating to obtain a semantic loss function according to the first image feature, the second image feature, the third image feature and the fourth image feature.

In the image encoder processing method provided in the embodiment of the present disclosure, a first image feature, a first mask position, a second image feature, and a second mask position are input into an online decoder corresponding to an online encoder, and the online decoder learns to reconstruct a first cropping image and a second cropping image, that is, a first decoding image and a second decoding image, according to the first image feature, the first mask position, the second image feature, and the second mask position; calculating a loss function according to the original first clipping image and the second clipping image and the first clipping image and the second clipping image which are learned and reconstructed, so as to realize mask image modeling; meanwhile, according to the first image feature, the second image feature, the third image feature and the fourth image feature, a semantic loss function is obtained through calculation, contrast learning is achieved, and training of the online encoder can be completed according to the first decoding loss function, the second decoding loss function and the semantic loss function to obtain an accurate online encoder.

The semantic loss function also includes a first semantic loss function and a second semantic loss function, specifically, the obtaining the semantic loss function according to the first image feature, the second image feature, the third image feature and the fourth image feature includes:

and obtaining a first semantic loss function according to the first image feature and the third image feature, and obtaining a second semantic loss function according to the second image feature and the fourth image feature.

The network parameters of the first image encoder can be adjusted according to the first decoding loss function, the second decoding loss function, the first semantic loss function and the second semantic loss function, so that the first image encoder with the adjusted network parameters is determined to be an accurate target image encoder.

According to the image encoder processing method provided by the embodiment of the specification, the training of the first image encoder is realized by calculating the first decoding loss function, the second decoding loss function, the first semantic loss function and the second semantic loss function, and in mask image modeling (namely, the calculation process of the decoding loss function), the network recovers global overall information from local image block information, so that the relation of characteristics between the local and the overall is learned, and the understanding of the network to the visual information and scenes is facilitated; in contrast learning (i.e. the calculation process of the semantic loss function), the network learns the common characteristic information in different images through an inter-volume loss function (i.e. the global semantic loss function in the following embodiment), such as the characteristics of similar body structures and organ positions in human CT scanning, so that the network is facilitated to understand the anatomical structure information of human CT and enable downstream tasks; in contrast learning, the network learns invariance of local semantic features through an intra-volume loss function (namely a local semantic loss function in the following embodiment), for example, in each human CT scan, after local parts such as organs and tissues are subjected to certain deformation, the semantic information is unchanged; the invariance feature of the network learning is beneficial to enhancing the robustness of the network learning, and the foreground parts such as different organisation organs and the like are accurately understood in the downstream task.

In practical application, because the first image feature and the second image feature are the image features after mask processing, the semantic information is less, if the distance between the image features after mask processing and the image features after data enhancement is directly shortened, the generated feature representation is not very accurate; therefore, in the image encoder processing method provided in the embodiments of the present disclosure, the image feature of a certain decoding layer in the online decoder may be zoomed in with the feature after the data enhancement, so as to realize contrast learning. The specific implementation mode is as follows:

the specific implementation manner of obtaining the first semantic loss function according to the first image feature and the third image feature, and obtaining the second semantic loss function according to the second image feature and the fourth image feature is as follows:

inputting the first image feature and the first mask position into an image decoder to obtain a first decoded image, a fifth image feature output by a multi-layer perception network of the image decoder, and inputting the second image feature and the second mask position into the image decoder to obtain a second decoded image and a sixth image feature output by the multi-layer perception network of the image decoder;

and obtaining a semantic loss function according to the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature and the sixth image feature.

Specifically, after the first image feature, the first mask position, the second image feature, and the second mask position are obtained, the first image feature and the first mask position are input to the online decoder to be decoded, so as to obtain a first decoded image output by the online decoder and a fifth image feature output by a multi-layer perceptual network (MLP) of the online encoder, where the multi-layer perceptual network may be connected to any coding layer of the online encoder, such as a shallow coding layer or a deep coding layer, and the embodiment of the present disclosure is not limited in any way.

And similarly, inputting the second image characteristic and the second mask position into an online decoder for decoding to obtain a second decoded image output by the online decoder and a sixth image characteristic output by a multi-layer perception network of the online encoder.

Meanwhile, according to the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature and the sixth image feature, a semantic loss function is obtained through calculation.

In the image encoder processing method provided in the embodiment of the present disclosure, a first image feature, a first mask position, a second image feature, and a second mask position are input into an online decoder corresponding to an online encoder, and the online decoder learns to reconstruct a first cropping image and a second cropping image, that is, a first decoding image and a second decoding image, according to the first image feature, the first mask position, the second image feature, and the second mask position; calculating a loss function according to the original first clipping image and the second clipping image and the first clipping image and the second clipping image which are learned and reconstructed, so as to realize mask image modeling; meanwhile, according to the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature and the sixth image feature, a semantic loss function is obtained through calculation, contrast learning is achieved, and training of an online encoder can be completed according to the first decoding loss function, the second decoding loss function and the semantic loss function in the follow-up process, so that an accurate online encoder is obtained.

In a specific implementation, the obtaining a semantic loss function according to the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature, and the sixth image feature includes:

global average pooling processing is carried out on the third image feature, the fourth image feature, the fifth image feature and the sixth image feature, and a global semantic loss function is obtained according to the processed third image feature, fourth image feature, fifth image feature and sixth image feature;

and inputting the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature and the sixth image feature into a semantic alignment network for processing to obtain a local semantic loss function.

The semantic alignment network can be understood as CASA (conditional anatomical semantic alignment ), which is capable of adaptively learning semantic anatomical features most similar to occlusion views in feature embedding of raw volume data in the field of human medicine. Thereby aligning the pairs of contrast-learned feature samples, a more specific and consistent positive pair is made for self-distillation (Self Distillation).

In practical applications, a linear mapping header is added to the target encoder and the online decoder, respectively, to generate positive feature pair 1: a fourth image feature, a fifth image feature; positive feature pair 2: a third image feature, a sixth image feature; wherein, two linear mapping heads can be formed by a 3-layer multi-layer perceptron. After the positive feature pairs are obtained, global averaging pooling may then be used to pool the fourth, fifth, third, and sixth image features and obtain their global visual semantics: namely a fourth image feature, a fifth image feature, a third image feature and a sixth image feature after global average pooling processing. Finally, a model learning task is defined by constraining the advanced feature representations of the positive sample pair (i.e., the first initial image and the second initial image make up the positive sample pair) to be more recent in distance in their corresponding feature space, in which a global semantic loss function is calculated from the fourth image feature, the fifth image feature, the third image feature, the sixth image feature.

Thus, the subsequent online encoder explicitly learns the universally consistent features from the internal body structure by optimizing the relationship among the different human body data in the human body medical scene, and models the anatomical invariance, which has robustness to the size, shape, intensity and texture diversity of the body parts caused by the change, organ deformation and pathological change of different human bodies.

To further exploit the semantic information learning capability obtained from self-distillation, local semantic relationships inside the data can be optimized by maximizing the similarity between views from the same human data; inputting the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature and the sixth image feature into a semantic alignment network for processing to obtain a local semantic loss function; therefore, the subsequent online encoder can adaptively learn semantic anatomical features most similar to the occlusion view in feature embedding of the original human body data in the field of human body medicine.

Specifically, the obtaining the global semantic loss function according to the processed third image feature, fourth image feature, fifth image feature and sixth image feature includes:

according to the processed third image feature and the processed sixth image feature, a first global semantic loss function is obtained, and according to the processed fourth image feature and the processed fifth image feature, a second global semantic loss function is obtained.

Similarly, the inputting the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature and the sixth image feature into a semantic alignment network for processing to obtain a local semantic loss function includes:

Inputting the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature, and the sixth image feature into a semantic alignment network;

in the semantic alignment network, respectively taking the first image feature and the second image feature as query features, and performing self-attention mechanism learning on the third image feature and the fourth image feature to obtain a seventh image feature and an eighth image feature;

in the semantic alignment network, respectively taking the first image feature and the second image feature as query features, and performing self-attention mechanism learning on the fifth image feature and the sixth image feature to obtain a ninth image feature and a tenth image feature;

obtaining a first local semantic loss function according to the seventh image feature and the ninth image feature, and obtaining a second local semantic loss function according to the eighth image feature and the tenth image feature.

Then, after the first decoding loss function, the second decoding loss function, the first global semantic loss function, the second global semantic loss function, the first local semantic loss function, and the second local semantic loss function are obtained, the network parameters of the first image encoder may be adjusted according to the first global semantic loss function, the second global semantic loss function, the first local semantic loss function, and the second local semantic loss function, so as to obtain the target image encoder.

In the image encoder processing method provided in the embodiment of the present disclosure, an anatomical semantic alignment and category invariance model framework (i.e., a framework including an online encoder, a target encoder and an online decoder in the embodiment of the present disclosure) is used in a medical field scene, two registered image blocks (i.e., segmented images with the same semantic meaning) are first acquired from CT images of different human bodies using a SAM model, and then four different views (an enhanced view and a mask view) from the image blocks are respectively input to the online encoder and the target encoder. The on-line encoder randomly masks the image and operates on the remaining visible image content, the target encoder operates on the whole view, the on-line decoder learns to reconstruct the input, mask image modeling is performed, and the scheme provides a Conditional Anatomical Semantic Alignment (CASA) module to correct and obtain better contrast, so that the coding accuracy of the target image encoder is improved.

Namely, the image encoder processing method provided by the embodiment of the specification is used for modeling the invariance of specific semantics by utilizing the anatomical similarity of the cross-medical image, and a conditional anatomical feature alignment module (namely CASA) is provided in an anatomical semantic alignment and category invariance model, and the most relevant high-level semantics between the contrast of the online encoder and the target encoder are matched by supplementing the global alignment anatomical semantics and the inter-plaque topology of the mask view, so that the coding accuracy of the trained target image encoder is improved.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a specific processing procedure of an image encoder processing method according to an embodiment of the present disclosure.

As can be seen in connection with FIG. 3, FIG. 4

Representing a first mask image,/->

Representing a second mask image, ">

Representing a first enhanced image->

Representing a second enhanced image.

For the generation manners of the first mask image, the second mask image, the first enhanced image and the second enhanced image, reference may be made to the description of the above embodiments, and the description thereof will not be repeated here.

In practice, after the first mask image, the second mask image, the first enhanced image and the second enhanced image are obtained, then

An on-line encoder (i.e., the first image encoder of the above embodiment) is input to obtain a first image feature +.>

Second image feature->

And mask positions in the first image feature and in the second image feature>

Mask keys; then will

And the MaskTokens input on-line decoder (i.e. the image decoder implemented as described above) decodes to obtain a first decoded image +.>

Second decoded picture +.>

Then according to->

Q (first initial image), K (second initial image) calculate the first decoding loss function and second decoding loss function; and training the on-line encoder based on the first decoding loss function and the second decoding loss function.

Will be described above

Inputting +.>

The input target encoder (i.e. the second image encoder of the above embodiment) performs the encoding process, and the target encoder outputs the third image feature ∈>

Fourth image feature->

At the same time, the fifth image feature of the multi-layer perceptron output added by the online decoder is acquired>

Sixth image feature->

Then, according to->

Generating positive feature pairs->

And

then use global average pooling pair +.>

Processing to obtain their global visual semantic representation +.>

And->

Then calculate global semantic loss from the global visual semantic representationThe loss functions, namely the first global semantic loss function and the second global semantic loss function described above.

Meanwhile, in CASA, respectively

For inquiring about features, p->

Self-attention mechanism learning is performed to obtain a seventh image feature +.>

Eighth image feature->

For->

Self-attention mechanism learning is performed to obtain the ninth image feature +.>

Tenth image feature->

The follow-up can be based on->

And calculating local semantic loss functions, namely the first local semantic loss function and the second local semantic loss function.

Finally, the online encoder can be obtained according to the training of the global semantic loss function (Inter-volume loss function) and the local semantic loss function (intra-volume loss function), wherein the two loss functions in the oval frame at the lower right corner in fig. 4 are mainly different in input, and the input of the Inter-volume loss function is from different initial images, so that the purpose of learning the commonality among different initial images can be achieved; the input of the Intra-volume loss function is from different views of the same initial image (namely, different views with enhanced data are made), so that the commonality between local and whole information in the same image can be learned, and the subsequent feature extraction effect of the online encoder is greatly improved.

The image encoder processing method provided in the embodiments of the present disclosure is applied to a medical field scene, and uses an anatomical semantic alignment and category invariance model framework (i.e., a framework including an online encoder, a target encoder and an online decoder in the embodiments of the present disclosure), firstly, two registered image blocks (i.e., segmented images with the same semantic meaning) are acquired from CT images of different human bodies by using a SAM model, and then four different views (an enhanced view and a mask view) from the image blocks are input to the online encoder and the target encoder, respectively. The on-line encoder randomly masks the image and operates on the remaining visible image content, the target encoder operates on the whole view, the on-line decoder learns to reconstruct the input, mask image modeling is performed, and the scheme provides a Conditional Anatomical Semantic Alignment (CASA) module to correct and obtain better contrast, so that the coding accuracy of the target image encoder is improved.

Referring to fig. 5, fig. 5 shows a flow chart of an image segmentation method according to an embodiment of the present disclosure.

Step 502: and determining a three-dimensional image of a target type of a target object, and inputting the three-dimensional image into a target image encoder to obtain the corresponding encoding image characteristics of the three-dimensional image.

The target image encoder is the target image encoder in the image encoder processing method.

Specifically, in a human medical scene, the three-dimensional image of the target type of the target object includes a CT image of the target human body.

Step 504: and inputting the characteristic of the coded image into an image segmentation model to obtain a segmented image of the target part.

The image segmentation model can be understood as a segmentation model of any network structure.

According to the image segmentation method provided by the embodiment of the specification, the three-dimensional image of the target type of the target object is subjected to feature coding by using the target image encoder in the embodiment, so that more accurate coding image features are obtained, and a subsequent image segmentation model can carry out more accurate image segmentation on the three-dimensional image according to the accurate coding image features.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of an image encoder processing apparatus, and fig. 6 shows a schematic structural diagram of an image encoder processing apparatus provided in one embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:

a segmented image determination module 602 configured to determine a first cropped image and a second cropped image that have the same semantics from a first initial image and a second initial image, wherein the first initial image and the second initial image are the same type of image that contains different target objects;

A mask processing module 604 configured to perform mask processing on the first clipping image and the second clipping image to obtain a first mask image and a second mask image;

a data enhancement module 606 configured to perform data enhancement on the first cropped image and the second cropped image to obtain a first enhanced image and a second enhanced image;

an encoder determination module 608 configured to adjust network parameters of the first image encoder to obtain a target image encoder based on the first mask image, the second mask image, the first enhancement image, and the second enhancement image.

Optionally, the encoder determination module 608 is further configured to:

Optionally, the first initial image and the second initial image are CT images including different target human bodies.

According to the image encoder processing device provided by the embodiment of the specification, first, a first clipping image and a second clipping image with the same semantics are mined from different first initial images and second initial images, and two segmentation images used for describing the same part are generated, so that the purpose that a specific type of non-deformation is modeled by utilizing the same type of structural similarity of different target objects in the process of obtaining a target image encoder is achieved; and the global semantics of the first mask image and the second mask image are supplemented according to the first enhanced image and the second enhanced image, so that more accurate feature representations are learned in a contrast view mode, and downstream task processing can be performed based on the more accurate feature representations.

The above is a schematic solution of an image encoder processing apparatus of the present embodiment. It should be noted that, the technical solution of the image encoder processing apparatus and the technical solution of the image encoder processing method belong to the same concept, and details of the technical solution of the image encoder processing apparatus that are not described in detail can be referred to the description of the technical solution of the image encoder processing method.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of an image segmentation apparatus, and fig. 7 shows a schematic structural diagram of an image segmentation apparatus according to one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

the feature obtaining module 702 is configured to determine a three-dimensional image of a target type of a target object, and input the three-dimensional image into a target image encoder to obtain an encoded image feature corresponding to the three-dimensional image, wherein the target image encoder is a target image encoder in the image encoder processing method;

an image segmentation module 704 configured to input the encoded image features into an image segmentation model to obtain a segmented image of the target region.

Optionally, the three-dimensional image of the target type of the target object comprises a CT image of the target human body.

According to the image segmentation device provided by the embodiment of the specification, the target image encoder in the embodiment is utilized to perform feature encoding on the three-dimensional image of the target type of the target object, so that more accurate encoded image features are obtained, and the follow-up image segmentation model can perform more accurate image segmentation on the three-dimensional image according to the accurate encoded image features.

Fig. 8 illustrates a block diagram of a computing device 800 provided in accordance with one embodiment of the present description. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include public switched telephone networks (PSTN, pub l ic Switched Te lephone Network), local area networks (LAN, loca l Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, persona l Area Network), or combinations of communication networks such as the internet. Access device 840 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface contro l ler), such as an ieee 802.11 wireless local area network (WLAN, wi re less Loca l Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, wor ldwide I nteroperabi l ity for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, un iversa l Ser ia l Bus) interface, a cellular network interface, a bluetooth interface, a near field communication (NFC, near Fie ld Commun icat ion) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, persona l Computer). Computing device 800 may also be a mobile or stationary server.

Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the image encoder processing method or the image segmentation method described above. The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the image encoder processing method or the image segmentation method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the image encoder processing method or the image segmentation method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described image encoder processing method or image segmentation method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the image encoder processing method or the image segmentation method described above belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the image encoder processing method or the image segmentation method described above.

An embodiment of the present disclosure also provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the above-described image encoder processing method or image segmentation method.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the image encoder processing method or the image segmentation method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the image encoder processing method or the image segmentation method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An image encoder processing method, comprising:

2. The image encoder processing method of claim 1, wherein the adjusting network parameters of the first image encoder to obtain the target image encoder according to the first mask image, the second mask image, the first enhanced image, and the second enhanced image comprises:

3. The image encoder processing method of claim 2, the determining a decoding loss function and a semantic loss function from the first image feature, the first mask position, the second image feature, the second mask position, the third image feature, and the fourth image feature, comprising:

4. The image encoder processing method of claim 3, the obtaining a semantic loss function from the first image feature, the second image feature, the third image feature, and the fourth image feature, comprising:

5. The image encoder processing method of claim 2, the determining a decoding loss function and a semantic loss function from the first image feature, the first mask position, the second image feature, the second mask position, the third image feature, and the fourth image feature, comprising:

6. The image encoder processing method of claim 5, the obtaining a semantic loss function from the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature, the sixth image feature, comprising:

7. The image encoder processing method of claim 6, wherein the obtaining the global semantic loss function according to the processed third image feature, fourth image feature, fifth image feature, and sixth image feature comprises:

8. The image encoder processing method of claim 6, the processing the first image feature, the second image feature, the third image feature, the fourth image feature, the fifth image feature, and the sixth image feature into a semantic alignment network to obtain a local semantic loss function, comprising:

9. The image encoder processing method of claim 2, the inputting the first enhanced image and the second enhanced image into a second image encoder to obtain a third image feature and a fourth image feature, comprising:

10. The image encoder processing method of claim 1, wherein determining the first cropped image and the second cropped image having the same semantics from the first initial image and the second initial image comprises:

11. The image encoder processing method of any of claims 1-10, the first initial image and the second initial image being CT images comprising different target human bodies.

12. An image encoder processing apparatus comprising:

13. An image segmentation method, comprising:

determining a three-dimensional image of a target type of a target object, and inputting the three-dimensional image into a target image encoder to obtain coded image features corresponding to the three-dimensional image, wherein the target image encoder is a target image encoder in the image encoder processing method according to any one of claims 1-11;

14. The image segmentation method according to claim 13, wherein the three-dimensional image of the target type of the target object comprises a CT image of a target human body.