CN114549556A

CN114549556A - Image segmentation method, related device, equipment and storage medium

Info

Publication number: CN114549556A
Application number: CN202210181434.0A
Authority: CN
Inventors: 江铖; 庞建业; 姚建华
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-05-27

Abstract

The application discloses an image segmentation method relating to an artificial intelligence technology, which comprises the following steps: acquiring original image characteristics of an original image; acquiring a first image characteristic through an encoder in an image segmentation model, wherein a local visual module comprises a spatial enhancement module, the spatial enhancement module generates a spatial scaling factor based on the input image characteristic, and the spatial scaling factor can correct the spatial characteristic of the input image characteristic; acquiring a second image characteristic through a decoder in the image segmentation model; splicing the second image characteristic and a third image characteristic output by a first local vision module in the encoder to obtain a target image characteristic; and generating a target segmentation image according to the characteristics of the target image. The application also provides devices, equipment and media. The method and the device use the spatial scaling factor to adjust the spatial characteristics of the image characteristics, and the spatial scaling factor is generated based on the input image characteristics in a self-adaptive mode, so that the difficulty of model configuration is reduced.

Description

Image segmentation method, related device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to an image segmentation method, a related apparatus, a device, and a storage medium.

Background

With the continuous development of computer technology and Artificial Intelligence (AI) technology, AI technology has achieved significant results in tasks such as image segmentation and image recognition. With the advent of more and more efficient network architectures, computer vision and natural language processing also tend to converge, and the use of transformers to perform vision tasks has become a new direction to reduce the complexity of the architecture, explore scalability and training efficiency.

Currently, a visual Transformer capable of solving the computer vision problem is proposed. To enhance the characterization capability of the visual Transformer, in existing schemes, additional branches can be added to the visual Transformer. The characteristic diversity is enhanced by introducing linear transformation on the branch, and the purpose of enhancing the characteristic characterization capability is achieved.

However, the inventors have found that there is at least the problem in the existing schemes that how many branches to add on the visual Transformer and which linear transformation to use are difficult to decide in advance, often requiring multiple attempts to know the optimal configuration. Therefore, the model configuration process is complicated, and the model configuration efficiency is reduced.

Disclosure of Invention

The embodiment of the application provides an image segmentation method, a related device, equipment and a storage medium. The method and the device use the spatial scaling factor to adjust the spatial characteristics of the image characteristics, and achieve the purpose of enhancing the characteristic representation capability. The spatial scaling factor is generated in a self-adaptive mode based on the characteristics of the input image, so that the difficulty of model configuration is reduced, and the efficiency of model configuration is improved.

In view of the above, an aspect of the present application provides an image segmentation method, including:

acquiring original image characteristics corresponding to an original image;

based on original image characteristics, acquiring first image characteristics through an encoder in an image segmentation model, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a space enhancement module, each space enhancement module is used for generating a space scaling factor based on input image characteristics, and the space scaling factor is used for correcting the space characteristics corresponding to the input image characteristics;

acquiring second image characteristics through a decoder in an image segmentation model based on the first image characteristics, wherein the decoder comprises a plurality of local visual modules and a plurality of upsampling layers;

splicing the second image characteristic and a third image characteristic output by a first local visual module in an encoder to obtain a target image characteristic, wherein the second image characteristic is an image characteristic output by a last local visual module in a decoder;

and generating a target segmentation image according to the characteristics of the target image.

Another aspect of the present application provides an image segmentation method, including:

acquiring original image characteristics corresponding to an original image;

based on original image characteristics, acquiring first image characteristics through an encoder in an image segmentation model, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a channel enhancement module, each channel enhancement module is used for generating a channel scaling factor based on the input image characteristics, and each channel scaling factor is used for correcting the channel characteristics corresponding to the input image characteristics;

Another aspect of the present application provides an image segmentation apparatus, including:

the acquisition module is used for acquiring the original image characteristics corresponding to the original image;

the image segmentation module is used for segmenting an original image into a plurality of image features, wherein the image segmentation module is used for obtaining the image features of the original image;

the acquisition module is further used for acquiring second image characteristics through a decoder in the image segmentation model based on the first image characteristics, wherein the decoder comprises a plurality of local visual modules and a plurality of upsampling layers;

the processing module is used for splicing the second image characteristic and a third image characteristic output by a first local vision module in the encoder to obtain a target image characteristic, wherein the second image characteristic is an image characteristic output by a last local vision module in the decoder;

and the segmentation module is used for generating a target segmentation image according to the characteristics of the target image.

In one possible design, in another implementation of another aspect of an embodiment of the present application, the local vision module further comprises a first normalization layer and an attention module;

the first normalization layer is used for performing normalization processing on the first input features to obtain first normalized features;

the attention module is used for extracting the features of the first normalized features to obtain the spatial features of the first normalized features;

the space enhancement module is used for performing maximum pooling operation and average pooling operation on the first normalized features to obtain a target merging result, wherein the target merging result comprises a maximum pooling result and an average pooling result;

the space enhancement module is also used for carrying out convolution operation on the target combination result to obtain a convolution result;

the space enhancement module is also used for calculating the convolution result by adopting the activation function to obtain a target space scaling factor.

the attention module is used for carrying out feature extraction on the first input features to obtain spatial features of the first input features;

the space enhancement module is used for performing maximum pooling operation and average pooling operation on the space features of the first input features to obtain a target merging result, wherein the target merging result comprises a maximum pooling result and an average pooling result;

the space enhancement module is also used for calculating the convolution result by adopting an activation function to obtain a target space scaling factor;

the first normalization layer is used for normalizing the corrected spatial features, wherein the corrected spatial features are obtained by correcting the spatial features of the first input features by adopting a target spatial scaling factor.

In one possible design, in another implementation manner of another aspect of the embodiment of the present application, the local vision module further includes a channel enhancement module, and the channel enhancement module generates a channel scaling factor based on the input image feature, where the channel scaling factor is used to modify a channel feature corresponding to the input image feature.

In one possible design, in another implementation of another aspect of an embodiment of the present application, the local vision module further comprises a second normalization layer and a multilayer perceptron;

the second normalization layer is used for performing normalization processing on the second input features to obtain second normalized features;

the multilayer perceptron is used for carrying out feature extraction on the second normalized feature to obtain a channel feature of the second normalized feature;

the channel enhancement module is used for performing maximum pooling operation on the channel characteristics of the second normalized characteristics to obtain a maximum pooling result, and performing average pooling operation on the channel characteristics of the second normalized characteristics to obtain an average pooling result;

the channel enhancement module is also used for carrying out convolution operation on the maximum pooling result to obtain a first convolution result, and carrying out convolution operation on the average pooling result to obtain a second convolution result;

the channel enhancement module is also used for determining a target convolution result according to the first convolution result and the second convolution result;

the channel enhancement module is also used for calculating the target convolution result by adopting the activation function to obtain a target channel scaling factor.

the multilayer perceptron is used for carrying out feature extraction on the second input features to obtain channel features of the second input features;

the channel enhancement module is used for performing maximum pooling operation on the channel characteristics of the second input characteristics to obtain a maximum pooling result, and performing average pooling operation on the channel characteristics of the second input characteristics to obtain an average pooling result;

the channel enhancement module is also used for calculating a target convolution result by adopting an activation function to obtain a target channel scaling factor;

and the second normalization layer is used for performing normalization processing on the corrected channel characteristics, wherein the corrected channel characteristics are obtained by correcting the channel characteristics of the second input characteristics by adopting the target channel scaling factor.

In one possible design, in another implementation manner of another aspect of the embodiment of the present application, the encoder includes M local vision modules and N downsampled layers, the decoder includes (M-1) local vision modules and N upsampled layers, the (M-1) local vision modules included in the decoder are in skip connection with corresponding (M-1) local vision modules in the encoder, where M is an integer greater than 1, and N is an integer greater than or equal to 1;

the image characteristics output by a first local vision module in the encoder are used as the image characteristics input by a first downsampling layer in the encoder;

the image characteristics output by the last downsampling layer in the encoder are used as the image characteristics input by the last local visual module in the encoder;

the image feature output by the last local vision module in the encoder is used as the image feature input by the first up-sampling layer in the decoder;

the image features output by a first up-sampling layer in the decoder are used as the image features input by a first local visual module in the decoder;

the image feature output by the last up-sampling layer in the decoder is used as the image feature input by the last local visual module in the encoder.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for acquiring an original image;

carrying out blocking operation on an original image to obtain K image blocks, wherein K is an integer greater than 1;

performing convolution operation on each image block in the K image blocks to obtain K characteristic vectors;

and generating original image features according to the K feature vectors.

an obtaining module, configured to obtain, based on an original image feature, a first intermediate image feature through a local vision module included in an encoder, where the encoder belongs to an image segmentation model;

acquiring a second intermediate image feature through a down-sampling layer included in an encoder based on the first intermediate image feature;

and acquiring the first image characteristic through a residual module included by the encoder based on the second intermediate image characteristic, wherein the residual module included by the encoder comprises at least one local visual module, or the residual module included by the encoder comprises at least one local visual module and at least one down-sampling layer.

an obtaining module, configured to obtain, based on the first image feature, a third intermediate image feature through an upsampling layer included in a decoder, where the decoder belongs to an image segmentation model;

acquiring a fourth intermediate image feature by a local vision module included in the decoder based on the third intermediate image feature;

and acquiring the second image characteristics through a residual module included by the decoder based on the fourth intermediate image characteristics, wherein the residual module included by the decoder comprises at least one local visual module, or the residual module included by the decoder comprises at least one local visual module and at least one upsampling layer.

the segmentation module is specifically used for performing convolution operation on the original image to obtain a first image feature to be processed;

performing convolution operation on the target image characteristic to obtain a second image characteristic to be processed;

adding the first image feature to be processed and the second image feature to be processed to obtain a third image feature to be processed;

and performing convolution operation on the third image feature to be processed to obtain a target segmentation image.

In one possible design, in another implementation manner of another aspect of the embodiment of the present application, the image segmentation apparatus further includes a training module;

the acquisition module is further used for acquiring sample image characteristics corresponding to the sample image, wherein the sample image is an image marked by the region;

the acquisition module is further used for acquiring a first sample image characteristic through an encoder in the image segmentation model based on the sample image characteristic;

the acquisition module is further used for acquiring second sample image characteristics through a decoder in the image segmentation model based on the first sample image characteristics;

the processing module is also used for splicing the second sample image characteristic and a third sample image characteristic output by a first local vision module in the encoder to obtain a target sample image characteristic;

the segmentation module is further used for generating a target segmentation sample image according to the sample image and the target sample image characteristics;

and the training module is used for updating the model parameters of the image segmentation model according to the target segmentation sample image and the labeled area of the sample image.

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is specifically used for acquiring original medical image characteristics corresponding to an original medical image, and the original medical image is a two-dimensional image or a three-dimensional image;

and the segmentation module is specifically used for generating and displaying a target segmentation medical image according to the target image characteristics.

the image segmentation module is used for segmenting an original image into a plurality of image features, and acquiring a first image feature through an encoder in the image segmentation model based on the original image feature, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a channel enhancement module, each channel enhancement module is used for generating a channel scaling factor based on the input image feature, and each channel scaling factor is used for correcting the channel feature corresponding to the input image feature;

In one possible design, in another implementation of another aspect of the embodiment of the present application, the local vision module further includes a spatial enhancement module, and the spatial enhancement module generates a spatial scaling factor based on the input image feature, and the spatial scaling factor is used to modify a spatial feature corresponding to the input image feature.

Another aspect of the present application provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method of the above aspects when executing the computer program.

Another aspect of the application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the above-described aspects.

In another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, an image segmentation method is provided, and first, an original image feature corresponding to an original image is obtained, and then, the original image feature is used as an input of an encoder in an image segmentation model, so as to obtain a first image feature. Next, the first image feature is taken as an input of a decoder in the image segmentation model, thereby obtaining a second image feature. And then, splicing the second image characteristic and the third image characteristic output by the first local vision module in the encoder to obtain the target image characteristic. And finally, combining the original image and the target image characteristics to generate a target segmentation image. Through the method, the spatial enhancement module is introduced into the local visual module, the spatial enhancement module can generate a spatial scaling factor based on the input image characteristics, and the spatial scaling factor can be used for adjusting the spatial characteristics corresponding to the image characteristics, so that the purpose of enhancing the characteristic representation capability is achieved. Therefore, the spatial scaling factor is generated in a self-adaptive mode based on the characteristics of the input image, the configuration difficulty of the image segmentation model is reduced, and the configuration efficiency of the image segmentation model is improved.

Drawings

FIG. 1 is a block diagram of an image segmentation system according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an application scenario of an image segmentation task in an embodiment of the present application;

FIG. 3 is a flowchart illustrating an image segmentation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a local vision module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of modifying spatial features based on a local vision module according to an embodiment of the present application;

FIG. 6 is another schematic diagram of a local vision module in an embodiment of the present application;

FIG. 7 is another illustration of a local vision module-based spatial feature modification in an embodiment of the present application;

FIG. 8 is another schematic diagram of a local vision module in an embodiment of the present application;

FIG. 9 is a diagram illustrating the modification of channel characteristics based on a local vision module according to an embodiment of the present application;

FIG. 10 is a schematic view of another embodiment of a local vision module of the present application;

FIG. 11 is another illustration of a local vision module based modification of channel characteristics in an embodiment of the present application;

FIG. 12 is a schematic structural diagram of an image segmentation model in an embodiment of the present application;

FIG. 13 is a graph showing a comparison of experimental effects in examples of the present application;

FIG. 14 is a schematic flow chart illustrating an image segmentation method according to an embodiment of the present application;

FIG. 15 is another schematic diagram of a partial vision module according to an embodiment of the present application;

FIG. 16 is a diagram illustrating an image segmentation apparatus according to an embodiment of the present application;

FIG. 17 is another schematic diagram of an image segmentation apparatus in an embodiment of the present application;

FIG. 18 is a schematic structural diagram of a server in an embodiment of the present application;

fig. 19 is a schematic structural diagram of a terminal device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Computer Vision (CV) technology based on Artificial Intelligence (AI) and deep learning has made significant progress. Many CV tasks require intelligent segmentation of an image, dividing pixels in the image into meaningful classes of objects that are semantically interpretable and correspond to real-world classes. Thereby, the content in the image can be understood, and the analysis of each part is easier. CV is a science for researching how to make a machine look, and more specifically, it refers to replacing human eyes with a camera and a computer to perform machine vision such as recognition and measurement of a target, and further performing image processing, so that the computer processing becomes an image more suitable for human eyes to observe or to transmit to an instrument to detect. As a scientific discipline, CV research-related theories and techniques attempt to build AI systems that can acquire information from images or multidimensional data. CV technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, AI is an integrated technique of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence. AI is to study the design principles and implementation methods of various intelligent machines, so that the machine has the functions of perception, reasoning and decision making. The AI technology is a comprehensive subject, and relates to the field of extensive technology, both hardware level technology and software level technology. The AI base technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The AI software technology mainly includes CV technology, speech processing technology, natural language processing technology, machine learning/deep learning, and the like.

The image segmentation method provided by the application is suitable for the following scenes:

firstly, identifying a target;

individuals are identified from the video for comparison of object features selected from the input image with objects in a database (e.g., license plate, face, etc.).

Secondly, automatic driving;

autonomous vehicles require perception and understanding of the surrounding environment in order to drive safely. Objects of the relevant category include other vehicles, buildings, pedestrians, and the like. Image segmentation enables the autonomous vehicle to identify which regions in the image are safe to drive.

Thirdly, medical imaging;

clinically relevant information is extracted from the medical image. For example, a radiologist may use machine learning to enhance analysis by segmenting an image into different organs, tissue types, or disease symptoms, thereby reducing the time required to run diagnostic tests.

Fourthly, retail image identification;

the retailer is given knowledge of the layout of the items on the shelves. And processing product data in real time through image segmentation to detect whether goods exist on the goods shelf or not. If the product is out of stock, the relevant personnel can be notified and a solution can be recommended for the supply chain.

In order to increase the accuracy of image segmentation and improve the efficiency of model configuration in the above-mentioned scenario, the present application proposes an image segmentation method, which is applied to the image segmentation system shown in fig. 1, where the image segmentation system shown in the figure includes at least one of a server and a terminal device. The client is deployed on the terminal device, wherein the client may run on the terminal device in a browser form, or may run on the terminal device in an independent Application (APP) form, and the specific presentation form of the client is not limited herein. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, safety service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a medical imaging device, a vehicle-mounted device, a wearable device, and the like, but is not limited thereto. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited. The scheme provided by the application can be independently completed by the terminal device, can also be independently completed by the server, and can also be completed by the cooperation of the terminal device and the server, so that the application is not particularly limited.

Illustratively, in one case, an image segmentation system includes a server and a terminal device. And the terminal equipment sends the acquired image to a server, and the server stores the trained image segmentation model locally. Based on the image segmentation result, the server calls a local image segmentation model to process the image, and therefore the image segmentation result is obtained.

Illustratively, in another case, the image segmentation system includes a server. The server locally stores the images and the trained image segmentation models. Based on the image segmentation result, the server calls a local image segmentation model to process the local image, and therefore the image segmentation result is obtained.

Illustratively, in yet another case, the image segmentation system includes a terminal device. The terminal equipment locally stores the trained image segmentation model. Based on the method, the terminal device calls a local image segmentation model to process the acquired image, and therefore an image segmentation result is obtained.

For convenience of understanding, please refer to fig. 2, and fig. 2 is a schematic view of an application scenario of an image segmentation task in an embodiment of the present application, where as shown in the figure, an original image to be segmented is obtained, the original image is used as an input of an image segmentation model, and a target segmentation image is output through the image segmentation model. The target-segmented image includes a background portion (i.e., an area made up of black pixels) and a foreground portion, wherein the foreground portion includes a human-type segmented area (i.e., an area made up of white pixels) and a box-type segmented area (i.e., an area made up of gray pixels).

With reference to fig. 3, the image segmentation method in the present application may be executed by a computer device, where the computer device may be a terminal or a server, and the method includes:

110. acquiring original image characteristics corresponding to an original image;

in one or more embodiments, an original image is acquired, where the original image may be a two-dimensional image or a three-dimensional image. The two-dimensional image may be represented as C × H × W, and the three-dimensional image may be represented as C × H × W × D, where C denotes the number of channels, H denotes the image height, W denotes the image width, and D denotes the image depth.

Specifically, the original image is used as an input of a feature extraction module in the image segmentation model, and the features of the original image are output through the feature extraction module. It is understood that the feature extraction module may obtain the original image features by block embedding (patch embedding) based on a convolution stem (convolution stem) or a patch stem (patch stem).

120. Based on original image characteristics, acquiring first image characteristics through an encoder in an image segmentation model, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a space enhancement module, each space enhancement module is used for generating a space scaling factor based on input image characteristics, and the space scaling factor is used for correcting the space characteristics corresponding to the input image characteristics;

in one or more embodiments, the original image feature is used as an input to an encoder in the image segmentation model, and the first image feature is output by the encoder. The encoder includes a number of local vision modules and a number of downsampling layers, each downsampling layer for performing a reduction process on the image. Each partial Vision module includes two parts, one part is a basic partial Vision transform (Local Vision-TR) module, and the other part is an adaptive scaled enhanced short-cut (ASES). The ASES includes at least a spatial enhancement module.

In particular, the spatial enhancement module included in each of the local vision modules may output a corresponding spatial scaling factor based on the input image characteristics. Based on this, the spatial feature of the input image feature is modified with the spatial scaling factor. Therefore, the purpose of enhancing the spatial characteristics is achieved.

It can be understood that the image segmentation model includes an encoder and a decoder, and the original image features are processed by the encoder and then by the decoder, and finally the image segmentation is realized. Wherein, the encoder can make the image segmentation model understand the content of the image, and as the network layer deepens, the size of the original image features is reduced, and the channels are increased.

It should be noted that the Local Vision-TR module in the present application may use a windowed Vision Transformer, such as Cross-window Transformer (CSWin-Transformer), or shuffle Transformer (shuffle Transformer), or Pyramid Vision Transformer (PVT) v1, or use other forms of transformers, which are not limited herein.

130. Acquiring second image characteristics through a decoder in an image segmentation model based on the first image characteristics, wherein the decoder comprises a plurality of local visual modules and a plurality of upsampling layers;

in one or more embodiments, the first image feature is used as an input to a decoder in the image segmentation model, and the second image feature is output by the decoder. The decoder comprises a plurality of local vision modules and a plurality of up-sampling layers, wherein each up-sampling layer is used for amplifying the image. Each Local Vision module includes a Local Vision-TR module and an ASES. The ASES includes at least a spatial enhancement module.

It will be appreciated that the decoder, in conjunction with the encoder's understanding of the image content, recovers the position information of the image, with the first image feature increasing in size and the channel decreasing as the network layer deepens.

140. Splicing the second image characteristic and a third image characteristic output by a first local visual module in an encoder to obtain a target image characteristic, wherein the second image characteristic is an image characteristic output by a last local visual module in a decoder;

in one or more embodiments, the decoder includes a plurality of local visual modules, and the image feature output by the last local visual module is the second image feature. Based on this, the second image feature and the third image feature output by the first local vision module in the encoder may be stitched (concat) to obtain the target image feature. And the concat operation is utilized to fuse the image characteristics on the corresponding positions in the up-sampling and down-sampling processes, so that a decoder can acquire more high-resolution information during up-sampling, further, the detail information in the original image is restored more perfectly, and the segmentation precision is improved.

Specifically, a first local visual module in the encoder and a last local visual module in the decoder are in skip-connection (skip-connection), and feature information on a corresponding scale can be introduced into an up-sampling or deconvolution process by adopting skip-connection, so that multi-scale and multi-level information is provided for subsequent image segmentation, and a finer segmentation effect can be obtained.

150. And generating a target segmentation image according to the characteristics of the target image.

In one or more embodiments, the target image features are mapped to obtain a target segmented image.

Specifically, in one implementation, the mapping may be implemented using transposed convolution, and in another implementation, the mapping may be implemented using interpolation and convolution. Taking the example of mapping implemented by using the transposed convolution, if the original image is a two-dimensional image, the two-dimensional transposed convolution with convolution kernel size (kernel size) of 2 × 2 and 1 × 1 may be used to map the target image feature to the size of the original image. If the original image is a three-dimensional image, the target image features can be mapped to the size of the original image using a three-dimensional transpose convolution with kernel size of 2 × 2 and 1 × 1.

In an embodiment of the application, a method for image segmentation is provided. Through the method, the spatial enhancement module is introduced into the local visual module, the spatial enhancement module can generate a spatial scaling factor based on the input image characteristics, and the spatial scaling factor can be used for adjusting the spatial characteristics corresponding to the image characteristics, so that the purpose of enhancing the characteristic representation capability is achieved. Therefore, the spatial scaling factor is generated in a self-adaptive manner based on the characteristics of the input image, the configuration difficulty of the image segmentation model is reduced, and the configuration efficiency of the image segmentation model is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiments of the present application, the local vision module further includes a first normalization layer and an attention module;

In one or more embodiments, a manner of connection of a spatial enhancement module is presented. As can be seen from the foregoing embodiments, the Local Vision module includes a Local Vision-TR module and an ASES, wherein the ASES includes a spatial enhancement module, and the Local Vision-TR module includes a first normalization layer, an attention module, a second normalization layer, and a multi-layer perceptron (MLP).

Specifically, for the sake of easy understanding, please refer to fig. 4, fig. 4 is a schematic structural diagram of a local vision module in an embodiment of the present application, and as shown in the drawing, the attention module may employ window-based multi-headed self-attention (W-MSA), and the first Normalization layer and the second Normalization layer both employ Batch Normalization (BN). The first normalization layer and the attention module are capable of mining intrinsic relations between input data, and the second normalization layer and the MLP are capable of mining intrinsic relations between different features of the input data. The consistency mapping is maintained using residual concatenation for the attention module and the MLP, respectively. The spatial enhancement module, used with respect to the attention module, learns the adaptive scaling value (i.e., the target spatial scaling factor) from the input features, multiplies the output of the attention module, and adds the multiplied value to the first input feature. I.e. conforming to a scale-first plus bias form.

Based on this, referring to fig. 5 in conjunction with the structure shown in fig. 4, fig. 5 is a schematic diagram of the local vision module-based spatial feature modification in the embodiment of the present application, as shown in the figure, assuming that the attention module employs the W-MSA and the first normalization layer employs the BN layer, so that the overall flow of the adaptively scaled spatial enhanced shortcut can be expressed as:

wherein Z is_lRepresenting a first input feature.

Representing a first normalized feature. BN () represents normalization processing of input features.

Representing the use of E for the first normalized feature_sThe spatial enhancement shortcut learns a spatial adaptive scaling factor (i.e., a target spatial scaling factor).

Spatial features representing the first normalized features learned by the W-MSA. Thereby scaling the target space by a factor

Spatial features with first normalized features

Multiplying, scaling, and adding the first input characteristic Z_lAnd adding to obtain the corrected first input characteristic finally.

The target spatial scaling factor is calculated as follows:

where σ () represents an activation function.

Indicating maximum pooling operationAnd mapping the first normalized feature to obtain a maximum pooling result.

An average pooling result obtained after mapping the first normalized feature using an average pooling operation is represented. Merging the maximum pooling result and the average pooling result to obtain a target merging result

2D convolution with kernel size of 3 x 3 is used for the target combination result, and a convolution result is obtained

Finally, the convolution result is calculated by adopting an activation function to obtain a target space scaling factor

It should be noted that, in the present application, the first Normalization Layer and the second Normalization Layer may use BN, or Layer Normalization (LN), or Group Normalization (LN), or other Normalization forms, which is not limited herein.

It should be noted that the activation function used in the present application may be a Sigmoid (Sigmoid) function, a modified Linear Unit (ReLU) function, a hyperbolic tangent (tanh) function, an Exponential Linear Unit (Elu) function, or other activation function forms, which is not limited herein.

It should be noted that the present application adopts convolution and pooling as the main calculation in the ASES, and it is also possible to use other components instead of convolution and pooling to implement the enhancement path, such as convolution, or pooling, or full-link layer, or nonlinear activation function, or normalization, or SoftMax, etc., which is not limited herein.

Secondly, this application embodiment provides a connection mode of space enhancement module. By the mode, the spatial enhancement module is added on the backbone network of the Local Vision-TR module, the decoupled spatial features are adaptively modified and enhanced, and the adaptive scaling of the spatial characterization is realized. Therefore, compared with the original Local Vision-TR, the method has the advantages that the effect is remarkably improved and a good effect is achieved without introducing additional complex calculation. Meanwhile, due to the fact that the module structure is simplified, the problems of overfitting and degradation of the Transformer model are solved.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiments of the present application, the local visual module further includes a first normalization layer and an attention module;

the first normalization layer is used for performing normalization processing on the corrected spatial features, wherein the corrected spatial features are obtained by correcting the spatial features of the first input features by adopting a target spatial scaling factor.

In one or more embodiments, another manner of connection of the spatial enhancement module is described. As can be seen from the foregoing embodiments, the Local Vision module includes a Local Vision-TR module and an ASES, wherein the ASES includes a spatial enhancement module, and the Local Vision-TR module includes a first normalization layer, an attention module, a second normalization layer, and an MLP.

Specifically, for ease of understanding, please refer to fig. 6, where fig. 6 is another structural diagram of the local vision module in the embodiment of the present application, and as shown in the figure, the attention module may employ a W-MSA, and the first normalization layer and the second normalization layer employ a BN. The consistency mapping is maintained using residual concatenation for the attention module and the MLP, respectively. The spatial enhancement module is used for the attention module, learns an adaptive scale scaling value (i.e. a target spatial scaling factor) from input features, multiplies the adaptive scale scaling value by the output of the attention module, normalizes the normalized features, and finally concatenates and adds the normalized features and the residual error of the attention module.

Based on this, referring to fig. 7 in conjunction with the structure shown in fig. 6, fig. 7 is another schematic diagram of the local vision module-based spatial feature modification in the embodiment of the present application, as shown, it is assumed that the attention module employs the W-MSA and the first normalization layer employs the BN layer. And performing feature extraction on the first input features to obtain the spatial features of the first input features. And then performing maximum pooling operation on the spatial features of the first input features to obtain a maximum pooling result, and performing average pooling operation on the spatial features of the first input features to obtain an average pooling result. Then, the maximum pooling result and the average pooling result are combined to obtain a target combined result. Thus, a 2D convolution with a kernel size of 3 × 3 may be used on the target merged result, resulting in a convolution result. And finally, calculating the convolution result by using an activation function to obtain a target space scaling factor.

And multiplying the target spatial scaling factor by the spatial feature of the first input feature output by the attention module to obtain the corrected spatial feature. Then, normalization processing is carried out on the corrected spatial features, and finally the normalized features and the first input features are added.

It should be noted that, in the present application, the first normalization layer and the second normalization layer may adopt BN, or, LN, or another normalization form, which is not limited herein.

It should be noted that the activation function used in the present application may be a Sigmoid function, or a ReLU function, or a tanh function, or an Elu function, or other activation function forms, which is not limited herein.

It should be noted that the present application adopts convolution and pooling as the main calculation in the ASES, and it is also possible to implement the enhancement path using other components instead of convolution and pooling, which is not limited herein.

Secondly, in the embodiment of the present application, another connection method of the space-enhancing module is provided. By the mode, the spatial enhancement module is added on the backbone network of the Local Vision-TR module, the decoupled spatial features are adaptively modified and enhanced, and the adaptive scaling of the spatial characterization is realized. Therefore, compared with the original Local Vision-TR, the method has the advantages that the effect is remarkably improved and a good effect is achieved without introducing additional complex calculation. Meanwhile, due to the fact that the module structure is simplified, the problems of overfitting and degradation of the Transformer model are solved.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiments of the present application, the local vision module further includes a channel enhancement module, where the channel enhancement module generates a channel scaling factor based on the input image feature, and the channel scaling factor is used to modify the channel feature corresponding to the input image feature.

In one or more embodiments, a manner of adding channel enhancements is presented. As can be seen from the foregoing embodiments, each Local Vision module includes a Local Vision-TR module and an ASES. The ASES may include not only a spatial enhancement module but also a channel enhancement module.

In particular, the channel enhancement module included in each local vision module may output a corresponding channel scaling factor based on the input image features. Based on the channel scaling factor, the channel characteristics corresponding to the input image characteristics are corrected. Thereby, the purpose of enhancing the channel characteristics is achieved.

Secondly, in the embodiment of the present application, a way of adding channel enhancement is provided. By the method, the spatial enhancement shortcut and the channel enhancement shortcut are added to the Local Vision-TR main network at the same time, so that the performance of the Local Vision-TR is improved, and a better image segmentation effect is achieved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided in the embodiments of the present application, the local visual module further includes a second normalization layer and a multilayer perceptron;

In one or more embodiments, a method of connecting channel enhancement modules is described. As can be seen from the foregoing embodiments, the Local Vision module includes a Local Vision-TR module and an ASES, wherein the ASES includes a spatial enhancement module and a channel enhancement module, and the Local Vision-TR module includes a first normalization layer, an attention module, a second normalization layer, and an MLP.

Specifically, for the convenience of understanding, please refer to fig. 8, and fig. 8 is another structural diagram of the local vision module in the embodiment of the present application, and as shown in the figure, the attention module may employ a W-MSA, and the first normalization layer and the second normalization layer both employ a BN. The first normalization layer and the attention module are capable of mining intrinsic relations between input data, and the second normalization layer and the MLP are capable of mining intrinsic relations between different features of the input data. The consistency mapping is maintained using residual concatenation for the attention module and the MLP, respectively. The spatial enhancement module is used for the attention module, learns the adaptive scale-up value (i.e., the target spatial scale factor) from the input features, multiplies the output of the attention module, and sums the product with the residual of the attention module. The channel enhancement module learns the adaptive scaling value (i.e., the target channel scaling factor) from the input features for use by the MLP, and multiplies the output of the MLP and adds the multiplied value to the second input feature. I.e. conforming to a scale-first plus bias form.

Based on this, referring to fig. 9 in conjunction with the structure shown in fig. 8, fig. 9 is a schematic diagram of channel feature modification based on local visual module in the embodiment of the present application, as shown in the figure, it is assumed that the second normalization layer employs a BN layer, and thus the overall flow of the adaptive scaling channel enhanced shortcut can be represented as follows:

Y_l ⁿ＝BN(Y_l)；

wherein, Y_lRepresenting a second input feature. Y is_l ⁿRepresenting a second normalized characteristic. BN () represents normalization processing of input features. E_c(Y_l ⁿ) Indicating the use of E for the second normalized feature_cThe channel enhancement shortcut learns the channel adaptive scaling factor (i.e., the target channel scaling factor). MLP (Y)_l ⁿ) And the channel characteristics of the second normalized characteristics learned by the MLP are represented. Thereby scaling the target channel by a factor E_c(Y_l ⁿ) Channel signature MLP (Y) with a second normalized signature_l ⁿ) Multiplying, scaling and correcting, and adding the second input characteristic Y_lAnd adding to finally obtain the corrected second input characteristic.

The target channel scaling factor is calculated as follows:

where σ () represents an activation function. Maxport (Y)_l ⁿ) Representing the maximum pooling result obtained after mapping the second normalized feature using the maximum pooling operation. Avgpool (Y)_l ⁿ) An average pooling result obtained after mapping the second normalized feature using an average pooling operation is represented. PWConv2D (MaxPoint (Y)_l ⁿ) Represents the first convolution result obtained after 2D point-wise convolution (PWConv 2D) using a kernel size of 1 for the max-pooling result. PWConv2D (AvgPool (Y)_l ⁿ) Represents the second convolution result obtained after using PWConv2D with kernel size of 1 on the average pooling result. The first convolution result and the second convolution result are directly added to obtain the target convolution result, i.e., PWConv2D (AvgPool (Y))_l ⁿ))+(PWConv 2D(MaxPool(Y_l ⁿ))). Finally, calculating the target convolution result by adopting an activation function to obtain a target channel scaling factor

It should be noted that, in the present application, the first normalization layer and the second normalization layer may adopt BN, or LN, or another normalization form, and are not limited herein.

Thirdly, in the embodiment of the present application, a connection mode of the channel enhancement module is provided. By the method, the channel enhancement module is added on the backbone network of the Local Vision-TR module, the decoupled channel characteristics are adaptively corrected and enhanced, and the adaptive scaling of the channel characteristics is realized. Therefore, compared with the original Local Vision-TR, the method has the advantages that the effect is remarkably improved and a good effect is achieved without introducing additional complex calculation. Meanwhile, due to the fact that the module structure is simplified, the problems of overfitting and degradation of the Transformer model are solved.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiments of the present application, the local visual module further includes a second normalization layer and a multilayer perceptron;

the channel enhancement module is used for performing maximum pooling operation on the channel characteristics of the second input characteristics to obtain a maximum pooling result and performing average pooling operation on the channel characteristics of the second input characteristics to obtain an average pooling result;

In one or more embodiments, another way of connecting channel enhancement modules is described. As can be seen from the foregoing embodiments, the Local Vision module includes a Local Vision-TR module and an ASES, wherein the ASES includes a spatial enhancement module and a channel enhancement module, and the Local Vision-TR module includes a first normalization layer, an attention module, a second normalization layer, and an MLP.

Specifically, referring to fig. 10 for ease of understanding, fig. 10 is another structural diagram of a local vision module in an embodiment of the present application, and as shown in the figure, the attention module may employ a W-MSA, and the first normalization layer and the second normalization layer both employ a BN. The consistency mapping is maintained using residual concatenation for the attention module and the MLP, respectively. The spatial enhancement module is used for the attention module, learns an adaptive scale scaling value (i.e. a target spatial scaling factor) from input features, multiplies the adaptive scale scaling value by the output of the attention module, normalizes the normalized features, and finally concatenates and adds the normalized features and the residual error of the attention module. The channel enhancement module is used for the attention module, learns an adaptive scale scaling value (i.e., a target channel scaling factor) from the input features, multiplies the adaptive scale scaling value by the output of the attention module, normalizes the normalized features, and adds the normalized features to the second input features.

Based on this, referring to fig. 11 in conjunction with the structure shown in fig. 10, fig. 11 is another schematic diagram of the channel feature modification based on the local vision module in the embodiment of the present application, as shown, it is assumed that the second normalization layer employs a BN layer. And performing feature extraction on the second input features to obtain channel features of the second input features. And then performing maximum pooling operation on the channel characteristics of the second input characteristics to obtain a maximum pooling result, and performing average pooling operation on the channel characteristics of the second input characteristics to obtain an average pooling result. Then, the maximum pooling result is convolved with PWConv2D to obtain a first convolution result, and the average pooling result is convolved with PWConv2D to obtain a second convolution result. And directly adding the first convolution result and the second convolution result to obtain a target convolution result, and finally calculating the target convolution result by using an activation function to obtain a target channel scaling factor.

And multiplying the target channel scaling factor by the channel characteristic of the second input characteristic output by the MLP to obtain the corrected channel characteristic. Then, normalization processing is carried out on the corrected channel characteristics, and finally, the normalized characteristics and the residual error of the MLP are connected and added.

In the embodiment of the present application, another connection method of the channel enhancement module is provided. By the method, the channel enhancement module is added on the backbone network of the Local Vision-TR module, the decoupled channel characteristics are adaptively corrected and enhanced, and the adaptive scaling of the channel characteristics is realized. Therefore, compared with the original Local Vision-TR, the method has the advantages that the effect is remarkably improved and a good effect is achieved without introducing additional complex calculation. Meanwhile, due to the fact that the module structure is simplified, the problems of overfitting and degradation of the Transformer model are solved.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, the encoder includes M local visual modules and N downsampled layers, the decoder includes (M-1) local visual modules and N upsampled layers, and the (M-1) local visual modules included in the decoder are in skip connection with the corresponding (M-1) local visual modules in the encoder, where M is an integer greater than 1, and N is an integer greater than or equal to 1;

In one or more embodiments, an overall structure of an image segmentation model is presented. As can be seen from the foregoing embodiments, the image segmentation model includes an encoder and a decoder, wherein the encoder includes a number of local visual modules and a number of down-sampling layers, and the decoder includes a number of local visual modules and a number of up-sampling layers. Wherein each Local Vision module comprises a Local Vision-TR module and an ASES.

Specifically, for the convenience of understanding, please refer to fig. 12, and fig. 12 is a schematic structural diagram of an image segmentation model in an embodiment of the present application, and as shown in the figure, it is assumed that an encoder includes 14 local vision modules and 3 downsampling layers, a decoder includes 13 local vision modules and 3 upsampling layers, and the 13 local vision modules included in the decoder are in skip connection with corresponding local vision modules in the encoder.

Based on this, an original image of Ci × H × W × D size is input, features are preliminarily embedded by using patch embedding, followed by learning of Local Vision-TR. Each phase includes a Local Vision-TR and an ASES. The image feature output by the first local vision module in the encoder is used as the image feature input by the first downsampling layer in the encoder (i.e., the third image feature), and the image feature output by the last downsampling layer in the encoder is used as the image feature input by the last local vision module in the encoder. Meanwhile, the third image feature output by the first local visual module in the encoder is spliced with the second image feature output by the last local visual module in the decoder to obtain the target image feature. When the target projection is performed, a three-dimensional transposition convolution (stride 2) with kernel size of 2 × 2 and 1 × 1 is used to map the target image feature to the original size, so that the image feature of C' o × H × W × D is obtained.

It will be appreciated that in upsampling a three-dimensional transpose convolution with a kernel size of 2 x 2 can be used, and that hopping connections use a hopping connection mode in a U-type network (U-Net).

It should be noted that the model structure shown in fig. 12 is only an illustration, and in practical applications, the structure and parameters may be adjusted according to specific situations, which is not limited herein.

Secondly, in the embodiment of the present application, an overall structure of an image segmentation model is provided. By adopting the mode, the image segmentation is carried out by adopting the image segmentation model with the pyramid structure, so that the learning capability and the feature expression capability of the visual Transformer can be enhanced, and the problem of degradation of the visual Transformer model can be prevented.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, the obtaining of the original image feature corresponding to the original image may specifically include:

acquiring an original image;

and generating original image features according to the K feature vectors.

In one or more embodiments, a manner of generating original image features based on a conditional step is presented. As can be seen from the foregoing embodiments, the conditional step can output the original image feature through the patch embedding. The original image features can be represented as a feature matrix, the feature matrix includes K feature vectors, and each feature vector corresponds to an image block.

Specifically, the original image may be divided into K (e.g., p × p) image blocks, and then each image block is mapped and converted into a d-dimensional feature vector. And combining the K d-dimensional feature vectors to obtain K x d-dimensional original image features. Assuming that the original image has a size of 224 × 224 and each image block has a size of 16, the number of patches is 196 (i.e., 14 × 14), and the image block mapping is equivalent to a convolution kernel of 16 × 16kernel size, step size 16.

It is understood that the conditional step is downsampled to 14 × 14 by 3 × 3 convolution. Illustratively, 4 convolutions of 3 × 3 size with step size of 2 and one convolution of 1 × 1 with step size of 1 may be set.

Secondly, in the embodiment of the present application, a manner of generating original image features based on a conditional step is provided. By the mode, the conditional step has a higher convergence speed and has better stability in the aspects of learning rate and weight attenuation, so that the model has better robustness.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the original image feature, obtaining the first image feature by an encoder in the image segmentation model may specifically include:

based on the original image characteristics, acquiring first intermediate image characteristics through a local vision module included by an encoder, wherein the encoder belongs to an image segmentation model;

In one or more embodiments, a manner of extracting first image features based on an encoder is presented. As can be seen from the foregoing embodiments, the image segmentation model includes an encoder and a decoder, wherein the encoder includes several Local Vision modules and several downsampled layers, and each Local Vision module includes a Local Vision-TR module and an ASES.

Specifically, for ease of understanding, referring to fig. 12 again, as shown in the figure, after the original image feature subjected to patch embedding is obtained, the original image feature is input to a local vision module in the encoder, and a first intermediate image feature is output by the local vision module, where the original image feature and the first intermediate image feature are both denoted as C × H/4 × W/4 × D/4. Based on this, the first intermediate image feature is input to one down-sampling layer in the encoder, from which a second intermediate image feature is output, where the second intermediate image feature is represented as 2C × H/8 × W/8 × D/8.

It is to be appreciated that the second intermediate image feature can be processed with the remaining modules (e.g., the at least one local vision module, or the at least one local vision module and the at least one downsampling layer) until the first image feature is obtained.

Secondly, in the embodiment of the present application, a manner of extracting the first image feature based on the encoder is provided. Through the method, a specific implementation mode is provided for the workflow of the image segmentation model, and therefore feasibility and operability of the scheme are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the first image feature, obtaining the second image feature by a decoder in the image segmentation model may specifically include:

acquiring a third intermediate image characteristic through an up-sampling layer included by a decoder based on the first image characteristic, wherein the decoder belongs to an image segmentation model;

In one or more embodiments, a manner of extracting second image features based on a decoder is presented. As can be seen from the foregoing embodiments, the image segmentation model includes an encoder and a decoder, wherein the decoder includes several Local Vision modules and several upsampled layers, and each Local Vision module includes a Local Vision-TR module and an ASES.

Specifically, for ease of understanding, referring again to fig. 12, as shown, after the encoded first image feature is obtained, it is input to an upsampling layer in the decoder, and a third intermediate image feature is output by the upsampling layer, wherein the third intermediate image feature is represented as 4C × H/16 × W/16 × D/16. Based on this, the third intermediate image feature is input to a local vision module in the decoder, from which a fourth intermediate image feature is output, where the fourth intermediate image feature is represented as 4C × H/16 × W/16 × D/16.

It is to be appreciated that the fourth intermediate image feature can be processed with the remaining modules (e.g., the at least one local vision module, or the at least one local vision module and the at least one upsampling layer) until the second image feature is obtained.

Secondly, in the embodiment of the present application, a manner of extracting the second image feature based on a decoder is provided. Through the method, a specific implementation mode is provided for the workflow of the image segmentation model, and therefore feasibility and operability of the scheme are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, the generating a target segmented image according to the target image feature may specifically include:

performing convolution operation on an original image to obtain a first image feature to be processed;

In one or more embodiments, a manner of generating a target segmented image based on residual joining is presented. As can be seen from the foregoing embodiments, the image segmentation model may also adopt a residual connection structure, so as to obtain the target segmentation image.

Specifically, for ease of understanding, please refer to fig. 12 again, as shown in the figure, the original image is taken as a three-dimensional image as an example, and the original image is denoted as Ci × H × W × D. And performing convolution operation on the original image by adopting one or more cascaded convolution layers to obtain a first image feature to be processed, wherein the first image feature to be processed is expressed as C' i multiplied by H multiplied by W multiplied by D. And after the target image features are subjected to target mapping, obtaining second image features to be processed, wherein the second image features to be processed are expressed as C' o multiplied by H multiplied by W multiplied by D. Based on this, the first to-be-processed image feature and the second to-be-processed image feature are subjected to addition processing, so that a third to-be-processed image feature can be obtained, and the third to-be-processed image feature is expressed as (C 'i + C' o) × H × W × D. And finally, obtaining a target segmentation image through one or more convolution layers.

Secondly, in the embodiment of the present application, a method for generating a target segmented image based on residual error connection is provided. By adopting the mode and adopting residual connection for calculation, the problem of gradient dispersion can be solved on one hand, and the problem of network degradation can be solved on the other hand. Therefore, the performance of the model is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, another optional embodiment provided in the embodiments of the present application may further include:

acquiring sample image characteristics corresponding to a sample image, wherein the sample image is an image subjected to region labeling;

acquiring a first sample image characteristic through an encoder in an image segmentation model based on the sample image characteristic;

acquiring a second sample image characteristic through a decoder in the image segmentation model based on the first sample image characteristic;

splicing the second sample image characteristic and a third sample image characteristic output by a first local vision module in an encoder to obtain a target sample image characteristic;

generating a target segmentation sample image according to the sample image and the target sample image characteristics;

and updating the model parameters of the image segmentation model according to the target segmentation sample image and the labeling area of the sample image.

In one or more embodiments, a manner of training an image segmentation model is presented. As can be seen from the foregoing embodiments, in the model training stage, similarly, the sample image is used as an input of the feature extraction module in the image segmentation model, and the feature of the sample image is output by the feature extraction module. It can be understood that the sample image is an image subjected to region labeling, that is, each pixel point in the sample image has a corresponding labeling category.

Specifically, the sample image feature is used as an input of an encoder in the image segmentation model, and the first sample image feature is output through the encoder. And then the first sample image characteristic is taken as the input of a decoder in the image segmentation model, and the second sample image characteristic is output through the decoder. Then, concat may be performed on the first sample image feature and the third sample image feature output by the first local vision module in the encoder, so as to obtain a target sample image feature. Finally, the target sample image features can be mapped, so that a target segmentation sample image is obtained. Based on the method, according to the target segmentation sample image and the labeled area of the sample image, a loss value between the prediction result and the real result is calculated by adopting a loss function, and the model parameters of the image segmentation model are updated by utilizing the loss value. And when the model training condition is met, obtaining the trained image segmentation model.

It is understood that the loss function used in the present application may be a cross-entropy loss function, or a cross-over-unity (IoU) loss function, or a focal point (focal) loss function, or other loss functions, which are not limited herein.

It should be noted that, in one case, a depletion type criterion may be used as a basis for determining whether the model training condition is satisfied, for example, an iteration threshold is set, and when the iteration threshold is reached, the model training condition is satisfied. In another case, an observation-type criterion may be employed as a basis for determining whether the model training condition is satisfied, for example, when the loss result has converged, i.e., indicating that the model training condition is satisfied.

Secondly, in the embodiment of the present application, a method for training an image segmentation model is provided. By the mode, model parameters of the image segmentation model can be trained, the end-to-end training effect is achieved, and the model parameters are learned by combining specific tasks, so that the robustness of the model is improved.

acquiring original medical image characteristics corresponding to an original medical image, wherein the original medical image is a two-dimensional image or a three-dimensional image;

generating a target segmentation image according to the target image features, which may specifically include:

and generating and displaying the target segmentation medical image according to the target image characteristics.

In one or more embodiments, a manner of performing segmentation processing on a medical image is presented. As can be seen from the foregoing embodiments, the original image may be a three-dimensional image, and the "original image" is exemplified as the "original medical image". The original medical image is usually a three-dimensional image, and the Imaging mechanism is different from a natural image, and is usually Computed Tomography (CT) or Magnetic Resonance Imaging (MRI), and the resolution is lower than that of the natural image, and the contrast is smaller, so that the boundary of the segmentation task is difficult to determine.

Specifically, after the target segmentation medical image corresponding to the original medical image is obtained, the target segmentation medical image can be displayed at the front end, so that a user can conveniently view the target segmentation medical image.

In this regard, the present application uses a synapse (synapse) public dataset consisting of 30 CT scans corresponding to 13 organs labeled on the subject's abdomen, each CT scan containing 80 to 225 slices, with a thickness between 1 mm and 6 mm. Each CT three-dimensional image is preprocessed by normalizing the intensity values of-1000, 1000 to between 0, 1. All images were resampled to equi-directional three-dimensional images with a 1.0 mm pitch and the input to the model scaled the CT image to 128 x 128. Since it is a task of 13-class multi-organ segmentation, the output class is 14 classes (including background). For 30 cases of three-dimensional CT images, a synapse data set can be divided into 24 cases as a training set and 6 cases as a test set according to a certain dividing mode. During training, the batch size (batch size) can be set to be 2, the learning rate can be set to be 1e-4, the optimizer can estimate the weight attenuation (AdamW) by using the adaptive moment, and the weight attenuation (weight decay) can be set to be 1 e-5.

Based on this, taking the synapse abdominal multi-organ CT segmentation task as an example, for convenience of explanation, please refer to fig. 13, fig. 13 is a schematic diagram for comparing experimental effects in the embodiment of the present application, and as shown in the figure, a label Group (GT) is used as a reference group. The first group is obtained by using ASES in the present application as an enhancement of Local Vision-TR and performing image segmentation processing, and then obtaining Average Dice 0.8170. The second group is an effect of image segmentation processing using a U-type transformer (unetr), and obtains Average Dice 0.7919. The third group is an effect of image segmentation processing using a swing U-network (Swin-uet), and Average Dice 0.7565 is obtained. The fourth group is an effect of image segmentation processing using a 3D CSWin-Transformer, and obtains Average Dice 0.7687. The fifth group is an effect of image segmentation processing using a segmentation network (TransUnet), and obtains Average Dice 0.7479. The sixth group is an effect of image segmentation processing using a 3D U type network (3D U-Net), and obtains Average Dice 0.7519.

The most advanced model in the 3D universal vision task (SOTA), namely CSWin-Transformer, Average Dice 0.7681, was used. Based on the synapse dataset, UNETR's Average Dice 0.7838. The enhanced shortcut provided by the application is obviously improved on the Local Vision-TR, and the model further exceeds the existing SOTA effect.

Secondly, in the embodiment of the present application, a method for performing segmentation processing on a medical image is provided. By the method, the medical segmentation visual task has high precision and small extra calculation amount. The method has potential application value in a medical prediction task using Local Vision-TR as a backbone network, can be integrated into medical image analysis software, and can also be applied to a Vision Transformer used in a general Vision task.

With reference to fig. 14, the image segmentation method in the present application may be executed by a computer device, where the computer device may be a terminal or a server, and the method includes:

210. acquiring original image characteristics corresponding to an original image;

Specifically, the original image is used as an input of a feature extraction module in the image segmentation model, and the features of the original image are output through the feature extraction module. It is understood that the feature extraction module may obtain the original image features by way of patch embedding based on a conditional step or a patch step.

220. Based on original image characteristics, acquiring first image characteristics through an encoder in an image segmentation model, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a channel enhancement module, each channel enhancement module is used for generating a channel scaling factor based on the input image characteristics, and each channel scaling factor is used for correcting the channel characteristics corresponding to the input image characteristics;

in one or more embodiments, the original image feature is used as an input to an encoder in the image segmentation model, and the first image feature is output by the encoder. The encoder includes a number of local vision modules and a number of downsampling layers, each downsampling layer for performing a reduction process on the image. Each Local visual module consists of two parts, one part is a Local Vision-TR module, and the other part is ASES. Wherein, the ASES at least comprises a channel enhancement module.

It should be noted that the Local Vision-TR module in the present application may use a windowed Vision Transformer, such as CSWin-Transformer, or Shuffle Transformer, or PVTv1, or use other forms of transformers, which is not limited herein.

230. Acquiring second image characteristics through a decoder in an image segmentation model based on the first image characteristics, wherein the decoder comprises a plurality of local visual modules and a plurality of upsampling layers;

240. Splicing the second image characteristic and a third image characteristic output by a first local visual module in an encoder to obtain a target image characteristic, wherein the second image characteristic is an image characteristic output by a last local visual module in a decoder;

in one or more embodiments, the decoder includes a plurality of local visual modules, and the image feature output by the last local visual module is the second image feature. Based on the above, concat can be performed on the second image feature and the third image feature output by the first local vision module in the encoder, so as to obtain the target image feature. And the concat operation is utilized to fuse the image characteristics on the corresponding positions in the up-sampling and down-sampling processes, so that a decoder can acquire more high-resolution information during up-sampling, further, the detail information in the original image is restored more perfectly, and the segmentation precision is improved.

Specifically, the first local visual module in the encoder and the last local visual module in the decoder are in jumping connection, and the characteristic information on the corresponding scale can be introduced into an up-sampling or deconvolution process by adopting the jumping connection, so that multi-scale and multi-level information is provided for subsequent image segmentation, and a finer segmentation effect can be obtained.

250. And generating a target segmentation image according to the characteristics of the target image.

In one or more embodiments, the target image features are mapped to obtain a target segmented image. For example, in one implementation, the mapping may be performed using transposed convolution, and in another implementation, the mapping may be performed using interpolation and convolution. Taking the example of mapping implemented by using the transposed convolution, if the original image is a two-dimensional image, the two-dimensional transposed convolution with kernel size of 2 × 2 and 1 × 1 may be used to map the target image features to the size of the original image. If the original image is a three-dimensional image, the target image features can be mapped to the size of the original image using a convolution of the three-dimensional transpose with a kernel size of 2 x 2 and 1 x 1.

Specifically, for the convenience of understanding, please refer to fig. 15, fig. 15 is another structural diagram of the Local Vision module in the embodiment of the present application, and as shown in the figure, the ASES includes a channel enhancement module, and the Local Vision-TR module includes a first normalization layer, an attention module, a second normalization layer, and an MLP. Wherein, the attention module can adopt W-MSA, and the first normalization layer and the second normalization layer adopt BN. The consistency mapping is maintained using residual concatenation for the attention module and the MLP, respectively. The channel enhancement module is used for the MLP, learns the self-adaptive proportional scaling value from the input features, multiplies the output of the MLP, and then is connected and added with the residual error of the MLP. I.e. conforming to a scale-first plus bias form.

In an embodiment of the application, a method for image segmentation is provided. By the method, the channel enhancement module is introduced into the local vision module, the channel enhancement module can generate the channel scaling factor based on the input image characteristics, and the channel characteristics corresponding to the image characteristics can be adjusted by using the channel scaling factor, so that the purpose of enhancing the characteristic characterization capability is achieved. Therefore, the channel scaling factor is generated in a self-adaptive mode based on the characteristics of the input image, the configuration difficulty of the image segmentation model is reduced, and the configuration efficiency of the image segmentation model is improved.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiments of the present application, the local vision module further includes a spatial enhancement module, where the spatial enhancement module generates a spatial scaling factor based on the input image feature, and the spatial scaling factor is used to modify the spatial feature corresponding to the input image feature.

In one or more embodiments, a way to incorporate spatial enhancement is presented. As can be seen from the foregoing embodiments, each Local Vision module includes a Local Vision-TR module and an ASES. The ASES may include not only a channel enhancement module but also a spatial enhancement module.

In particular, the spatial enhancement module included in each of the local vision modules may output a corresponding spatial scaling factor based on the input image characteristics. Based on this, the spatial feature corresponding to the input image feature is corrected by the spatial scaling factor. Therefore, the purpose of enhancing the spatial characteristics is achieved.

Secondly, in the embodiment of the present application, a way of adding spatial enhancement is provided. By the mode, the spatial enhancement shortcut and the spatial enhancement shortcut are added to the Local Vision-TR main network at the same time, so that the performance of the Local Vision-TR is improved, and a better image segmentation effect is achieved.

Referring to fig. 16, fig. 16 is a schematic diagram of an embodiment of an image segmentation apparatus in an embodiment of the present application, and the image segmentation apparatus 30 includes:

an obtaining module 310, configured to obtain an original image feature corresponding to an original image;

the obtaining module 310 is further configured to obtain, based on an original image feature, a first image feature through an encoder in an image segmentation model, where the encoder includes a plurality of local visual modules and a plurality of downsampling layers, the local visual module includes a spatial enhancement module, the spatial enhancement module is configured to generate a spatial scaling factor based on an input image feature, and the spatial scaling factor is used to modify a spatial feature corresponding to the input image feature;

the obtaining module 310 is further configured to obtain, based on the first image feature, a second image feature through a decoder in the image segmentation model, where the decoder includes a plurality of local visual modules and a plurality of upsampling layers;

the processing module 320 is configured to splice a second image feature and a third image feature output by a first local vision module in the encoder to obtain a target image feature, where the second image feature is an image feature output by a last local vision module in the decoder;

and the segmentation module 330 is configured to generate a target segmented image according to the target image feature.

Alternatively, on the basis of the embodiment corresponding to fig. 16, in another embodiment of the image segmentation apparatus 30 provided in the embodiment of the present application,

an obtaining module 310, specifically configured to obtain an original image;

and generating original image features according to the K feature vectors.

an obtaining module 310, specifically configured to obtain, based on an original image feature, a first intermediate image feature through a local vision module included in an encoder, where the encoder belongs to an image segmentation model;

an obtaining module 310, specifically configured to obtain, based on the first image feature, a third intermediate image feature through an upsampling layer included in a decoder, where the decoder belongs to an image segmentation model;

the segmentation module 330 is specifically configured to perform convolution operation on the original image to obtain a first to-be-processed image feature;

Optionally, on the basis of the embodiment corresponding to fig. 16, in another embodiment of the image segmentation apparatus 30 provided in the embodiment of the present application, the image segmentation apparatus 30 further includes a training module 340;

the obtaining module 310 is further configured to obtain sample image features corresponding to a sample image, where the sample image is an image labeled by a region;

an obtaining module 310, configured to obtain, by an encoder in an image segmentation model, a first sample image feature based on the sample image feature;

an obtaining module 310, further configured to obtain, by a decoder in the image segmentation model, a second sample image feature based on the first sample image feature;

the processing module 320 is further configured to splice the second sample image feature and a third sample image feature output by the first local vision module in the encoder to obtain a target sample image feature;

the segmentation module 330 is further configured to generate a target segmentation sample image according to the sample image and the target sample image feature;

the training module 340 is configured to update the model parameters of the image segmentation model according to the target segmentation sample image and the labeled region of the sample image.

an obtaining module 310, configured to obtain an original medical image feature corresponding to an original medical image, where the original medical image is a two-dimensional image or a three-dimensional image;

the segmentation module 330 is specifically configured to generate and display a target segmented medical image according to the target image feature.

Referring to fig. 17, fig. 17 is a schematic diagram of an embodiment of an image segmentation apparatus according to an embodiment of the present application, and the image segmentation apparatus 40 includes:

an obtaining module 410, configured to obtain an original image feature corresponding to an original image;

the obtaining module 410 is further configured to obtain, based on an original image feature, a first image feature through an encoder in an image segmentation model, where the encoder includes a plurality of local visual modules and a plurality of downsampling layers, the local visual module includes a channel enhancement module, the channel enhancement module is configured to generate a channel scaling factor based on an input image feature, and the channel scaling factor is used to correct a channel feature corresponding to the input image feature;

the obtaining module 410 is further configured to obtain, based on the first image feature, a second image feature through a decoder in the image segmentation model, where the decoder includes a plurality of local visual modules and a plurality of upsampling layers;

the processing module 420 is configured to splice a second image feature and a third image feature output by a first local vision module in the encoder to obtain a target image feature, where the second image feature is an image feature output by a last local vision module in the decoder;

and a segmentation module 430, configured to generate a target segmented image according to the target image feature.

Fig. 18 is a schematic diagram of a server structure provided by an embodiment of the present application, where the server 500 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and a memory 532, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, and/or one or more operating systems 541, such as a Windows Server^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^MAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 18.

Fig. 19 is a schematic structural diagram of a terminal device according to an embodiment of the present application, and as shown in fig. 19, for convenience of description, only a portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to a method portion in the embodiment of the present application. In the embodiment of the present application, a terminal device is taken as an example to explain:

fig. 19 is a block diagram illustrating a partial structure of a smartphone related to a terminal device provided in an embodiment of the present application. Referring to fig. 19, the smart phone includes: radio Frequency (RF) circuitry 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuitry 660, wireless fidelity (WiFi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 19 is not limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The following describes each component of the smartphone in detail with reference to fig. 19:

the RF circuit 610 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 680; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 620 may be used to store software programs and modules, and the processor 680 may execute various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phone book, etc.) created according to the use of the smartphone, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smartphone. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations of a user (e.g., operations of the user on the touch panel 631 or near the touch panel 631 by using any suitable object or accessory such as a finger or a stylus) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 680, and can receive and execute commands sent by the processor 680. In addition, the touch panel 631 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a mouse, a joystick, and the like.

The display unit 640 may be used to display information input by or provided to the user and various menus of the smartphone. The display unit 640 may include a display panel 641, and optionally, the display panel 641 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 631 may cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 19, the touch panel 631 and the display panel 641 are two separate components to implement the input and output functions of the smart phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the smart phone.

The smartphone may also include at least one sensor 650, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 641 and/or the backlight when the smartphone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the smartphone, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the smart phone, further description is omitted here.

Audio circuit 660, speaker 661, microphone 662 can provide an audio interface between the user and the smartphone. The audio circuit 660 may transmit the electrical signal converted from the received audio data to the speaker 661, and convert the electrical signal into an audio signal through the speaker 661 for output; on the other hand, the microphone 662 converts the collected sound signals into electrical signals, which are received by the audio circuit 660 and converted into audio data, which are processed by the audio data output processor 680 and then passed through the RF circuit 610 to be sent to, for example, another smartphone or output to the memory 620 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the smart phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 670, and provides wireless broadband internet access for the user. Although fig. 19 shows the WiFi module 670, it is understood that it does not belong to the essential constitution of the smartphone and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 680 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions of the smart phone and processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory 620. Optionally, processor 680 may include one or more processing units; optionally, the processor 680 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 680.

The smartphone also includes a power supply 690 (e.g., a battery) that provides power to the various components, optionally, the power supply may be logically connected to the processor 680 via a power management system, so that functions such as managing charging, discharging, and power consumption are implemented via the power management system.

Although not shown, the smart phone may further include a camera, a bluetooth module, and the like, which are not described herein.

The steps performed by the terminal device in the above-described embodiment may be based on the terminal device configuration shown in fig. 19.

The embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the processor implements the steps of the methods described in the foregoing embodiments.

The embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the methods described in the foregoing embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the methods described in the foregoing embodiments.

It is understood that in the specific implementation of the present application, related data such as original medical images are involved, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of image segmentation, comprising:

acquiring original image characteristics corresponding to an original image;

based on the original image features, obtaining first image features through an encoder in an image segmentation model, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a spatial enhancement module, the spatial enhancement module is used for generating a spatial scaling factor based on input image features, and the spatial scaling factor is used for correcting the spatial features corresponding to the input image features;

acquiring second image features through a decoder in the image segmentation model based on the first image features, wherein the decoder comprises a plurality of local visual modules and a plurality of upsampling layers;

splicing the second image feature and a third image feature output by a first local vision module in the encoder to obtain a target image feature, wherein the second image feature is the image feature output by the last local vision module in the decoder;

and generating a target segmentation image according to the target image characteristics.

2. The method of claim 1, wherein the local vision module further comprises a first normalization layer and an attention module;

the attention module is used for extracting features of the first normalized features to obtain spatial features of the first normalized features;

the space enhancement module is also used for carrying out convolution operation on the target merging result to obtain a convolution result;

the space enhancement module is further used for calculating the convolution result by adopting an activation function to obtain a target space scaling factor.

3. The method of claim 1, wherein the local vision module further comprises a first normalization layer and an attention module;

the attention module is used for carrying out feature extraction on a first input feature to obtain a spatial feature of the first input feature;

the spatial enhancement module is configured to perform maximum pooling operation and average pooling operation on the spatial features of the first input feature to obtain a target merging result, where the target merging result includes a maximum pooling result and an average pooling result;

the first normalization layer is configured to perform normalization processing on the corrected spatial feature, where the corrected spatial feature is obtained by correcting the spatial feature of the first input feature by using the target spatial scaling factor.

4. The method of claim 1, wherein the local vision module further comprises a channel enhancement module that generates a channel scaling factor based on the input image feature, the channel scaling factor being used to modify the channel feature corresponding to the input image feature.

5. The method of claim 4, wherein the local vision module further comprises a second normalization layer and a multi-layered perceptron;

the second normalization layer is used for performing normalization processing on a second input feature to obtain a second normalized feature;

the channel enhancement module is further used for performing convolution operation on the maximum pooling result to obtain a first convolution result, and performing convolution operation on the average pooling result to obtain a second convolution result;

the channel enhancement module is further used for determining a target convolution result according to the first convolution result and the second convolution result;

and the channel enhancement module is also used for calculating the target convolution result by adopting an activation function to obtain a target channel scaling factor.

6. The method of claim 4, wherein the local vision module further comprises a second normalization layer and a multi-layered perceptron;

the multilayer perceptron is used for carrying out feature extraction on a second input feature to obtain a channel feature of the second input feature;

the channel enhancement module is also used for calculating the target convolution result by adopting an activation function to obtain a target channel scaling factor;

the second normalization layer is configured to perform normalization processing on the corrected channel characteristics, where the corrected channel characteristics are obtained by correcting the channel characteristics of the second input characteristics by using the target channel scaling factor.

7. The method according to any one of claims 1 to 6, wherein the encoder comprises M of the local visual modules and N of the downsampled layers, the decoder comprises (M-1) of the local visual modules and N of the upsampled layers, and the (M-1) local visual modules comprised by the decoder are in skip connection with corresponding (M-1) local visual modules in the encoder, wherein M is an integer greater than 1, and N is an integer greater than or equal to 1;

the image feature output by a first one of the local vision modules in the encoder is used as the image feature input by a first one of the downsampling layers in the encoder;

the image feature output by the last downsampling layer in the encoder is used as the image feature input by the last local vision module in the encoder;

the image feature output by the last local vision module in the encoder is used as the image feature input by the first upsampling layer in the decoder;

the image features output by a first one of the upsampling layers in the decoder are used as the image features input by a first one of the local vision modules in the decoder;

and the image feature output by the last upsampling layer in the decoder is used as the image feature input by the last local vision module in the encoder.

8. The method according to claim 1, wherein the obtaining of the original image feature corresponding to the original image comprises:

acquiring the original image;

carrying out blocking operation on the original image to obtain K image blocks, wherein K is an integer greater than 1;

performing convolution operation on each image block in the K image blocks to obtain K feature vectors;

and generating the original image features according to the K feature vectors.

9. The method of claim 1, wherein obtaining the first image feature by an encoder in an image segmentation model based on the original image feature comprises:

obtaining, by the local vision module included in the encoder, a first intermediate image feature based on the original image feature, wherein the encoder belongs to the image segmentation model;

obtaining, by the downsampling layer included by the encoder, a second intermediate image feature based on the first intermediate image feature;

obtaining the first image feature by a residual module included in the encoder based on the second intermediate image feature, wherein the residual module included in the encoder includes at least one of the local visual module or the residual module included in the encoder includes at least one of the local visual module and at least one of the downsampling layer.

10. The method of claim 1, wherein obtaining, by a decoder in the image segmentation model, a second image feature based on the first image feature comprises:

obtaining a third intermediate image feature through the upsampling layer included in the decoder based on the first image feature, wherein the decoder belongs to the image segmentation model;

obtaining, by the local vision module included in the decoder, a fourth intermediate image feature based on the third intermediate image feature;

and obtaining the second image feature through a residual module included in the decoder based on the fourth intermediate image feature, wherein the residual module included in the decoder includes at least one local visual module, or the residual module included in the decoder includes at least one local visual module and at least one upsampled layer.

11. The method of claim 1, wherein generating a target segmented image based on the target image features comprises:

performing convolution operation on the original image to obtain a first image feature to be processed;

and performing convolution operation on the third image feature to be processed to obtain the target segmentation image.

12. The method of claim 1, further comprising:

obtaining, by an encoder in the image segmentation model, a first sample image feature based on the sample image feature;

splicing the second sample image characteristic and a third sample image characteristic output by a first local vision module in the encoder to obtain a target sample image characteristic;

13. The method according to claim 1, wherein the obtaining of the original image feature corresponding to the original image comprises:

the generating a target segmentation image according to the target image feature comprises:

and generating and displaying a target segmentation medical image according to the target image characteristics.

14. A method of image segmentation, comprising:

acquiring original image characteristics corresponding to an original image;

based on the original image features, obtaining first image features through an encoder in an image segmentation model, wherein the encoder comprises a plurality of local visual modules and a plurality of down-sampling layers, each local visual module comprises a channel enhancement module, the channel enhancement module is used for generating a channel scaling factor based on input image features, and the channel scaling factor is used for correcting the channel features corresponding to the input image features;

15. The method of claim 14, wherein the local vision module further comprises a spatial enhancement module that generates a spatial scaling factor based on the input image feature, the spatial scaling factor being used to modify the spatial feature to which the input image feature corresponds.

16. An image segmentation apparatus, comprising:

the obtaining module is further configured to obtain a first image feature through an encoder in an image segmentation model based on the original image feature, where the encoder includes a plurality of local visual modules and a plurality of downsampling layers, the local visual module includes a spatial enhancement module, the spatial enhancement module is configured to generate a spatial scaling factor based on an input image feature, and the spatial scaling factor is used to correct a spatial feature corresponding to the input image feature;

the acquisition module is further configured to acquire, based on the first image feature, a second image feature through a decoder in the image segmentation model, where the decoder includes a number of the local vision modules and a number of upsampling layers;

and the segmentation module is used for generating a target segmentation image according to the target image characteristics.

17. An image segmentation apparatus, comprising:

the obtaining module is further configured to obtain, based on the original image feature, a first image feature through an encoder in an image segmentation model, where the encoder includes a plurality of local visual modules and a plurality of downsampling layers, the local visual module includes a channel enhancement module, the channel enhancement module is configured to generate a channel scaling factor based on an input image feature, and the channel scaling factor is used to correct a channel feature corresponding to the input image feature;

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 13, or implements the steps of the method of any one of claims 14 to 15.

19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 13 or carries out the steps of the method of any one of claims 14 to 15.

20. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 13 or carries out the steps of the method of any one of claims 14 to 15.