CN113065551A

CN113065551A - Method for performing image segmentation using a deep neural network model

Info

Publication number: CN113065551A
Application number: CN202110294862.XA
Authority: CN
Inventors: 杨林; 亢宇鑫; 李涵生; 崔磊; 费达; 付士军; 徐黎; 杨海英
Original assignee: Hangzhou Diyingjia Technology Co ltd; AstraZeneca Investment China Co Ltd
Current assignee: Hangzhou Diyingjia Technology Co ltd; AstraZeneca Investment China Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-07-02
Anticipated expiration: 2041-03-19
Also published as: CN113065551B

Abstract

The present disclosure provides a method, apparatus, device, and medium for performing image segmentation using a deep neural network model. The deep neural network model comprises an encoder and a decoder, and the method comprises the following steps: acquiring an input image containing a region of interest; utilizing a plurality of coding layers of a coder to respectively extract semantic features of different scales of the input image; respectively decoding the semantic features of the multiple scales by utilizing a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic feature of each scale; integrating the prediction results corresponding to the semantic features of each scale to obtain a final prediction result about whether each pixel in the input image belongs to the region of interest; and outputting an output image of the region of interest segmented from the input image according to the final prediction result. The image segmentation method provided by the disclosure can improve the overall continuity of the region of interest segmented in the image, thereby improving the accuracy of the segmentation result.

Description

Method for performing image segmentation using a deep neural network model

Technical Field

The present disclosure relates to the field of image segmentation, and more particularly, to a method of performing image segmentation using a deep neural network model.

Background

Image segmentation plays an important role as an image processing technique in various fields of real life, for example, in the medical field, segmentation of a region of interest (e.g., a tumor region) is required. For example, the previous tumor region segmentation usually employs a manual identification and segmentation method, but there are many obstacles to manual identification and segmentation of the tumor region: (1) the slicing amount of the pathology department is increased by more than 15% per year, and the daily diagnosis workload of doctors is greatly increased; (2) the difference of morphological characteristics of the tumor area and the normal area is small, and the interpretation difficulty of a pathologist is large; (3) the accuracy of clinical pathological diagnosis is directly related to ordinary learning, research experience and reading number of pathological doctors, and experiments prove that interpretation among the pathological doctors often has subjective difference when the same section is interpreted; (4) medical resources are unevenly distributed in China, and most patients have to seek famous doctors in thousands of miles due to inexperienced pathological doctors in most of areas. This further limits the cultivation of a large number of pathologists while consuming a large number of resources. In the past, the regional difference of medical resources and medical conditions is further enlarged, and the Martian effect enables a few top doctors to be distributed to more patients, so that more doctor-patient contradictions, regional contradictions and social contradictions are generated. For this reason, an automated tumor region segmentation method is required to assist the pathologist in performing routine pathological section analysis.

For example, in the conventional interpretation work of immunohistochemical membrane staining, accurate quantitative statistics of tumor cells is of great importance for the grading diagnosis of cancer, and tumor region segmentation can help a pathologist to firstly lock a region of interest, and on the basis, accurate quantitative statistics can be carried out in the region to finally make pathological diagnosis. However, accurate and reliable segmentation of tumor regions is a key to pathological diagnosis.

At present, although some automatic image segmentation methods exist, the accuracy of region-of-interest segmentation still cannot meet the requirements of practical application, because the core idea of the conventional automatic segmentation method is to classify each pixel point of an image, further output a mask of the image and complete the segmentation of the whole image. However, the pixel-based segmentation method will lose the continuity of the whole object in the image, resulting in low accuracy of the segmentation result.

Therefore, an image segmentation method that can segment a region of interest more accurately is required.

Disclosure of Invention

In view of the above problems, the present disclosure provides a method for performing image segmentation using a deep neural network model, which can obtain an accurate region-of-interest segmentation result by extracting multi-scale semantic features of an image and integrating prediction results corresponding to the respective scale semantic features.

The embodiment of the present disclosure provides a method for performing image segmentation by using a deep neural network model, wherein the deep neural network model comprises an encoder and a decoder, and the method comprises: acquiring an input image containing a region of interest; utilizing a plurality of coding layers of a coder to respectively extract semantic features of different scales of the input image; respectively decoding the semantic features of the multiple scales by utilizing a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic feature of each scale; integrating the prediction results corresponding to the semantic features of each scale to obtain a final prediction result about whether each pixel in the input image belongs to the region of interest; and outputting an output image obtained by segmenting the region of interest from the input image according to the final prediction result.

According to the embodiment of the disclosure, the plurality of coding layers comprise N coding layers, wherein the size of the semantic features of the nth scale extracted by the nth coding layer is smaller than the size of the semantic features of the nth-1 st scale extracted by the nth coding layer, the nth scale is the smallest scale in the plurality of scales, N is a positive integer greater than or equal to 2, and N is less than or equal to N and greater than or equal to 2.

According to an embodiment of the present disclosure, wherein the coding layer includes a convolutional layer, a pooling layer, a batch normalization layer, and an activation layer.

According to the embodiment of the present disclosure, where N is 4, the extracting semantic features of different multiple scales of the input image by using multiple coding layers of an encoder respectively includes: extracting semantic features of the input image at a first scale by using a first coding layer of the plurality of coding layers; extracting semantic features of the input image at a second scale by using a second coding layer in the plurality of coding layers based on the semantic features at the first scale; extracting semantic features of the input image at a third scale by using a third coding layer in the plurality of coding layers based on the semantic features at the second scale; and extracting the semantic features of the input image at a fourth scale by utilizing a fourth coding layer in the plurality of coding layers based on the semantic features at the third scale.

According to the embodiment of the present disclosure, after the semantic features of the nth scale extracted by the nth coding layer, the method further includes: and performing further semantic feature extraction on the semantic features of the Nth scale and the input image input space pyramid structure.

According to an embodiment of the present disclosure, wherein the spatial pyramid structure includes: the first convolution layers with different expansion rates are used for respectively carrying out cavity convolution on the semantic features of the Nth scale so as to further capture multi-scale information; the second convolution layer is used for further convolving the semantic features of the Nth scale so as to enhance the coupling of the semantic feature channel; and the pooling layer is used for pooling the input images to obtain semantic features of image levels.

According to the embodiment of the present disclosure, the decoding layers include N decoding layers, and the decoding the semantic features of the plurality of scales by using the decoding layers of the decoder includes: and decoding by using an nth decoding layer corresponding to the nth coding layer to obtain an nth prediction result corresponding to the semantic features of the nth scale based on the semantic features of the nth scale extracted by the nth coding layer.

According to the embodiment of the present disclosure, where N is 4, the decoding with the nth decoding layer corresponding to the nth coding layer to obtain the nth prediction result based on the semantic feature of the nth scale extracted by the nth coding layer includes: decoding by using a first decoding layer corresponding to the first coding layer to obtain a first prediction result based on semantic features of a first scale extracted by the first coding layer; decoding by using a second decoding layer corresponding to the second coding layer to obtain a second prediction result based on the semantic features of the second scale extracted by the second coding layer; decoding with a third decoding layer corresponding to the third coding layer to obtain a third prediction result based on the semantic features of the third scale extracted by the third coding layer; and decoding by using a fourth decoding layer corresponding to the fourth coding layer to obtain a fourth prediction result based on the semantic features of the fourth scale extracted by the fourth coding layer.

According to an embodiment of the present disclosure, the decoding, by using a plurality of decoding layers of a decoder, the semantic features of the plurality of scales to obtain a prediction result corresponding to the semantic feature of each scale includes: and fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale, and decoding by using an (n-1) th decoding layer based on the fused semantic features to obtain an (n-1) th prediction result.

According to the embodiment of the disclosure, the fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale and decoding by using the (n-1) th decoding layer based on the fused semantic features to obtain the (n-1) th prediction result comprises: the semantic features of the nth scale are sampled upwards, the upsampled result is spliced with the semantic features of the nth-1 scale, the spliced result is upsampled, and the n-1 decoding layer is used for decoding based on the upsampled semantic features to obtain an n-1 prediction result.

According to the embodiment of the present disclosure, where N is 4, decoding the semantic features of the plurality of scales by using a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic features of each scale includes: the semantic features of the fourth scale are subjected to up-sampling, and decoding is carried out by utilizing a fourth decoding layer on the basis of the up-sampled semantic features to obtain a fourth prediction result; the semantic features of the fourth scale are subjected to up-sampling, the up-sampled result and the semantic features of the third scale are spliced to obtain a first splicing result, the first splicing result is subjected to up-sampling, and a third decoding layer is used for decoding based on the up-sampled semantic features to obtain a third prediction result; the first splicing result is subjected to upsampling, the upsampled result is spliced with the semantic features of the second scale to obtain a second splicing result, the second splicing result is subjected to upsampling, and decoding is carried out by utilizing a second decoding layer based on the upsampled semantic features to obtain a second prediction result; and upsampling the second splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a third splicing result, upsampling the third splicing result, and decoding by using the first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to an embodiment of the present disclosure, the decoding, by using a plurality of decoding layers of a decoder, the semantic features of the plurality of scales to obtain a prediction result corresponding to the semantic feature of each scale includes: and fusing the semantic features of the m scale and the semantic features of one scale from 1 st to m-2 nd scales, and decoding by using a decoding layer corresponding to the one scale based on the fused semantic features to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

According to the embodiment of the present disclosure, the fusing the semantic features of the m-th scale and the semantic features of one scale of the 1 st to m-2 nd scales and decoding the semantic features with the decoding layer corresponding to the one scale based on the fused semantic features to obtain the prediction result corresponding to the one scale includes: the semantic features of the m-th scale are subjected to up-sampling, the up-sampled result is spliced with the semantic features of one scale, the spliced result is subjected to up-sampling, and decoding is performed by utilizing a decoding layer corresponding to the one scale on the basis of the up-sampled semantic features to obtain a prediction result corresponding to the one scale.

According to the embodiment of the present disclosure, where N is 4, decoding the semantic features of the plurality of scales by using a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic features of each scale includes: the semantic features of the fourth scale are subjected to up-sampling, and decoding is carried out by utilizing a fourth decoding layer on the basis of the up-sampled semantic features to obtain a fourth prediction result; the semantic features of the fourth scale are subjected to up-sampling, the up-sampled result and the semantic features of the second scale are spliced to obtain a first splicing result, the first splicing result is subjected to up-sampling, and a second decoding layer is used for decoding based on the up-sampled semantic features to obtain a second prediction result; the semantic features of the third scale are subjected to up-sampling, and decoding is carried out by utilizing a third decoding layer on the basis of the up-sampled semantic features to obtain a third prediction result; and performing upsampling on the first splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a second splicing result, performing upsampling on the second splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to the embodiment of the present disclosure, the decoding, by using a plurality of decoding layers of a decoder, the semantic features of the plurality of scales to obtain a prediction result corresponding to the semantic feature of each scale, further includes: and performing upsampling after splicing the further extracted semantic features, and obtaining an Nth prediction result corresponding to the semantic features of the Nth scale by utilizing an Nth decoding layer based on the upsampled semantic features.

According to the embodiment of the present disclosure, the integrating the prediction results corresponding to the semantic features of each scale includes: and integrating the prediction results of the multiple decoding layers from bottom to top according to the scale of the semantic features corresponding to the prediction results, and sequentially integrating the prediction results of the lower layer and the prediction results of the adjacent upper layer to obtain the final prediction result.

According to the embodiment of the present disclosure, the integrating the prediction results of the multiple decoding layers from bottom to top according to the scale of the semantic features corresponding to the prediction results, and sequentially integrating the prediction results of the lower layer and the prediction results of the adjacent upper layer to obtain the final prediction result includes: under the condition that n is larger than 2, integrating the nth prediction result and the n-1 th prediction result, and integrating the integrated result and the n-2 th prediction result; in the case where n is equal to 2, the nth predictor is integrated with the n-1 th predictor.

According to the embodiment of the present disclosure, where N is 4, the integrating the prediction results corresponding to the semantic features of each scale includes: and integrating the fourth prediction result with the third prediction result, further integrating the integrated result with the second prediction result, integrating the further integrated result with the first prediction result, and taking the integrated result as a final prediction result.

According to the embodiment of the present disclosure, the integrating includes comparing the two objects to be integrated at a pixel level, and finding a maximum predicted value corresponding to each pixel as a final predicted result.

According to the embodiment of the present disclosure, the deep neural network model is trained by using a deep supervision manner, wherein in the deep neural network model training process, a backward-propagated gradient is obtained by calculating a prediction loss of each of the plurality of decoding layers of the decoder so as to update parameters of each encoding layer in the encoder.

The embodiment of the present disclosure provides an apparatus for performing image segmentation by using a deep neural network model, wherein the deep neural network model comprises an encoder and a decoder, and the apparatus comprises: an acquisition module configured to acquire an input image containing a region of interest; an extraction module configured to extract semantic features of the input image at different scales respectively by using a plurality of encoding layers of an encoder; a decoding module configured to decode the semantic features of the plurality of scales respectively by using a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic features of each scale; an integration module configured to integrate the prediction results corresponding to the semantic features of each scale to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest; and an output module configured to output an output image in which the region of interest is segmented from the input image according to the final prediction result.

According to an embodiment of the present disclosure, where N is 4, the extraction module includes: extracting semantic features of the input image at a first scale by using a first coding layer of the plurality of coding layers; extracting semantic features of the input image at a second scale by using a second coding layer in the plurality of coding layers based on the semantic features at the first scale; extracting semantic features of the input image at a third scale by using a third coding layer in the plurality of coding layers based on the semantic features at the second scale; and extracting the semantic features of the input image at a fourth scale by utilizing a fourth coding layer in the plurality of coding layers based on the semantic features at the third scale.

According to the embodiment of the disclosure, the decoding module comprises: and fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale, and decoding by using an (n-1) th decoding layer based on the fused semantic features to obtain an (n-1) th prediction result.

According to an embodiment of the present disclosure, where N is 4, the decoding module includes: the semantic features of the fourth scale are subjected to up-sampling, and decoding is carried out by utilizing a fourth decoding layer on the basis of the up-sampled semantic features to obtain a fourth prediction result; the semantic features of the fourth scale are subjected to up-sampling, the up-sampled result and the semantic features of the third scale are spliced to obtain a first splicing result, the first splicing result is subjected to up-sampling, and a third decoding layer is used for decoding based on the up-sampled semantic features to obtain a third prediction result; the first splicing result is subjected to upsampling, the upsampled result is spliced with the semantic features of the second scale to obtain a second splicing result, the second splicing result is subjected to upsampling, and decoding is carried out by utilizing a second decoding layer based on the upsampled semantic features to obtain a second prediction result; and upsampling the second splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a third splicing result, upsampling the third splicing result, and decoding by using the first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to the embodiment of the disclosure, the decoding module comprises: and fusing the semantic features of the m scale and the semantic features of one scale from 1 st to m-2 nd scales, and decoding by using a decoding layer corresponding to the one scale based on the fused semantic features to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

According to an embodiment of the present disclosure, where N is 4, the decoding module includes: the semantic features of the fourth scale are subjected to up-sampling, and decoding is carried out by utilizing a fourth decoding layer on the basis of the up-sampled semantic features to obtain a fourth prediction result; the semantic features of the fourth scale are subjected to up-sampling, the up-sampled result and the semantic features of the second scale are spliced to obtain a first splicing result, the first splicing result is subjected to up-sampling, and a second decoding layer is used for decoding based on the up-sampled semantic features to obtain a second prediction result; the semantic features of the third scale are subjected to up-sampling, and decoding is carried out by utilizing a third decoding layer on the basis of the up-sampled semantic features to obtain a third prediction result; and performing upsampling on the first splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a second splicing result, performing upsampling on the second splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to the embodiment of the present disclosure, the decoding module further includes: and performing upsampling after splicing the further extracted semantic features, and obtaining an Nth prediction result corresponding to the semantic features of the Nth scale by utilizing an Nth decoding layer based on the upsampled semantic features.

The embodiment of the present disclosure provides an apparatus for performing image segmentation by using a deep neural network model, including: a processor, and a memory storing computer-executable instructions that, when executed by the processor, cause the processor to perform the above-described method.

The disclosed embodiments provide a computer-readable recording medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, cause the processor to perform the above-mentioned method.

Embodiments of the present disclosure provide a method, apparatus, device, and medium for performing image segmentation using a deep neural network model. According to the method for performing image segmentation by using the deep neural network model, provided by the disclosure, the continuity of the whole region of interest segmented in the image is improved by extracting the multi-scale semantic features of the image and integrating the prediction results corresponding to the semantic features of each scale, so that the accuracy of the segmentation result is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below. It is apparent that the drawings in the following description are only exemplary embodiments of the disclosure, and that other drawings may be derived from those drawings by a person of ordinary skill in the art without inventive effort.

FIG. 1 is a flow chart illustrating an image segmentation method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating an image segmentation method according to a first embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an image segmentation method according to a second embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating an image segmentation method according to a third embodiment of the present disclosure;

fig. 5 is a diagram illustrating an example of a segmented image obtained using an image segmentation method according to an embodiment of the present disclosure;

fig. 6 is a block diagram illustrating an image segmentation apparatus according to an embodiment of the present disclosure;

fig. 7 is a structural diagram illustrating an image segmentation apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

In the present specification and the drawings, substantially the same or similar steps and elements are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

In the specification and drawings, elements are described in singular or plural according to embodiments. However, the singular and plural forms are appropriately selected for the proposed cases only for convenience of explanation and are not intended to limit the present disclosure thereto. Thus, the singular may include the plural and the plural may also include the singular, unless the context clearly dictates otherwise.

Currently, image semantic segmentation algorithms can be used for image segmentation. The image semantic segmentation method mainly comprises a manual semantic feature-based method and a depth semantic feature-based method, wherein the manual semantic feature-based method is mainly used for manually extracting semantic features such as textures, gray levels and edges in an image, and then a threshold value method, a pixel clustering-based segmentation method and a 'image division' segmentation method are used for carrying out pixel level classification on the image. And in the latter, the deep learning technology is mainly used for automatically extracting semantic features in the image, then the deep semantic features are decoded by means of deconvolution, upsampling, activation and the like, and finally the pixels are classified to obtain a segmentation result.

However, the conventional pixel-based segmentation method loses the continuity of the whole object in the image, and the accuracy of the segmentation result is not high.

To solve the above problem, the present disclosure provides a method of performing image segmentation using a deep neural network model. According to the method for performing image segmentation by using the deep neural network model, provided by the disclosure, the continuity of the whole region of interest segmented in the image is improved by extracting the multi-scale semantic features of the image and integrating the prediction results corresponding to the semantic features of each scale, so that the accuracy of the segmentation result is improved.

The image segmentation method provided by the present disclosure described above will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating an image segmentation method according to an embodiment of the present disclosure.

Referring to fig. 1, in step S110, an input image including a region of interest is acquired.

Here, the region of interest may be any target region of interest to the user, for example, the region of interest may be a tumor region, but is not limited thereto. It should be noted that, according to the different application fields of the image segmentation method of the present disclosure, the input image and the region of interest may be changed accordingly. In addition, the present disclosure is also not limited to the form of the input image, and for example, the input image may be a grayscale image or a color image. In addition, the present disclosure also has no limitation on the manner of acquiring the input image, and for example, the input image including the region of interest may be captured in response to a user request, a previously captured input image may be directly acquired from an external device, or the like. In addition, the input image may be an image of various resolutions, for example, an image of 512 × 512 resolution.

After the input image is acquired, image segmentation may be performed using a deep neural network model. Here, the deep neural network model may include an encoder and a decoder. The encoder may include a plurality of encoding layers and the decoder may include a plurality of decoding layers.

The deep learning technology based on the convolutional neural network is difficult to be applied in practical application due to the lack of interpretability of the deep learning technology. The existing methods for increasing interpretability mainly represent image areas concerned by networks in a mode of visualizing characteristic thermodynamic diagrams, however, the methods still differ from human cognition to a great extent and cannot be accepted by actual users. The present disclosure finds that the interpretability of an image segmentation method based on deep learning can be further improved with improved accuracy of image segmentation by combining the deep learning with the cognitive psychology theory. The theory of feature integration focuses on the early vision processing problem and can be divided into two stages.

(1) A characteristic registration stage: feature registration may help the human body perform a guided search of the surrounding environment. There are two processes in this stage, one is a feature extraction process and a human feature encoding process. First, the vision system extracts features from the laser-based approach, which is a process of parallel and automated processing. In this stage, only individual features can be detected, including color, size, orientation, contrast, slope, curvature and end points of the line segment. It may also include the distance difference between the motion and the distance, which features are free (not limited by the object to which they belong, whose position is subjectively uncertain). The perception system then encodes the features for each dimension independently, the result of the encoding being referred to as a feature map. It can be noted that, from the viewpoint of deep learning, the feature registration stage of the feature integration theory is similar to the shallow convolution operation, and the extraction of the basic features is performed.

(2) A characteristic integration stage: the perception system correctly connects different features (feature representations) together, so that a representation is obtained that can describe the object. At this stage, feature localization, i.e. determining the boundaries of the feature space, is first required. Second, the process of focused attention places more attention on features at a particular location. And finally, integrating the original characteristics. The deep learning finally corresponds to a feature integration stage for the feature decoding process.

In cognitive psychology, psychologists find the human visual cognitive process to be a bottom-up process. In the cranial neuroscience, the information of the high-level cortex flows to the low-level cortex, and the high-level information is further used for guiding the low-level cortex to focus on a specific area. The common points of the feature integration theory, the visual pathway model of the cranial neuroscience and the cognitive psychology are that the feature integration theory, the visual pathway model of the cranial neuroscience and the cognitive psychology summarize and prove the effectiveness and the reasonable existence of a bottom-up process and a top-down process. Inspired by the theory, the method combines the visual cognition process of people with the deep learning technology, and can effectively solve the accurate segmentation of the region of interest.

Specifically, in step S120, semantic features of different scales of the input image can be respectively extracted by using a plurality of encoding layers of an encoder.

For example, the plurality of encoding layers may include N encoding layers, wherein the size of the semantic features extracted by the nth encoding layer at the nth scale is smaller than the size of the semantic features extracted by the nth-1 encoding layer at the nth scale, the nth scale is the smallest scale among the plurality of scales, N is a positive integer greater than or equal to 2, and N is less than or equal to N and greater than or equal to 2.

According to an embodiment of the present disclosure, the encoding layer may include a convolutional layer, a pooling layer, a batch normalization layer, and an activation layer. Since those skilled in the art know the above mentioned convolution layer, pooling layer, batch normalization layer and active layer, details of the above mentioned convolution layer, pooling layer, batch normalization layer and active layer are not described herein again.

As an example, N may be 4. In this case, as shown in fig. 2, a semantic feature of the input image at a first scale may be first extracted using a first encoding layer (denoted as "encoding layer 1" in fig. 2) of the plurality of encoding layers; secondly, extracting semantic features of a second scale of the input image by using a second coding layer (represented as 'coding layer 2' in fig. 2) in the plurality of coding layers based on the semantic features of the first scale; next, extracting semantic features of a third scale of the input image by using a third coding layer (denoted as "coding layer 3" in fig. 2) of the plurality of coding layers based on the semantic features of the second scale; finally, based on the semantic features of the third scale, the semantic features of the fourth scale of the input image are extracted by using a fourth coding layer (represented as "coding layer 4" in fig. 2) of the plurality of coding layers.

As shown in fig. 2, for example, in the case where the input image is "512 × 1" (where 512 × 512 is the resolution (i.e., size) of the semantic features of the input image and 1 is the number of semantic feature channels of the input image), the semantic features of the first scale may be 256 × 64 semantic features, where 256 × 256 is the resolution (i.e., size) of the semantic features and 64 is the number of semantic feature channels; the semantic features of the second scale may be 128 × 128 semantic features, where 128 × 128 is the resolution of the semantic features and 128 is the number of semantic feature channels; the semantic features of the third scale may be 64 × 256 semantic features, where 64 × 64 is the resolution of the semantic features and 256 is the number of semantic feature channels; the semantic features of the fourth scale may be 32 × 512 semantic features, where 32 × 32 is the resolution of the semantic features and 512 is the number of semantic feature channels. Therefore, the size of the semantic features extracted from the top to the bottom of the coding layer is gradually reduced, so that the semantic features of multiple scales of the input image are extracted.

According to the embodiment of the present disclosure, after the extracting of the semantic features of the nth scale by the nth coding layer, the method may further include: and performing further semantic feature extraction on the semantic features of the Nth scale and the input image input space pyramid structure.

As an example, when N is 4, referring to fig. 2, after the semantic features of the fourth scale extracted by the fourth encoding layer, the semantic features of the fourth scale (32 × 512) and the input image (512 × 1) may be further input into the spatial pyramid structure for further extraction of the semantic features.

According to an embodiment of the present disclosure, the spatial pyramid structure may include a plurality of first convolution layers, second convolution layers, and pooling layers having different expansion rates. Specifically, a plurality of first convolution layers with different expansion rates may be used to perform hole convolution on the nth scale semantic features respectively to further capture multi-scale information; the second convolution layer can be used for further convolution of the semantic features of the Nth scale so as to enhance the coupling of semantic feature channels; the pooling layer may be configured to pool the input images to obtain image-level semantic features. The above-mentioned spatial pyramid structure will be described later with reference to fig. 4, and will not be described herein again.

The process of extracting the multi-scale features from top to bottom can simulate the visual cognition process: the low-level cranial nerves extract low-level features and continuously distribute information to higher levels, and the higher levels re-extract and re-combine the low-level features. This process resembles the commonly used Convolutional Neural Network (CNN), and thus, as an example, it can be simulated with a residual network (Res-Net).

The residual network may be composed of four residual blocks, each of which may be composed of a convolutional layer, a batch normalization layer, an activation function layer, a dropout layer, and a full connection layer. Those skilled in the art know how to extract semantic features of an image by using a residual error network, and therefore, details of extracting the semantic features by using the residual error network are not described herein.

Referring back to fig. 1, after obtaining different multiple scales of semantic features of the input image, in step S130, the multiple scales of semantic features may be decoded by using multiple decoding layers of a decoder, respectively, so as to obtain a prediction result corresponding to the semantic features of each scale.

According to an embodiment of the present disclosure, the plurality of decoding layers may include N decoding layers. Specifically, in step S130, based on the semantic features of the nth scale extracted by the nth coding layer, the nth decoding layer corresponding to the nth coding layer may be used for decoding to obtain the nth prediction result corresponding to the semantic features of the nth scale.

As an example, N may be 4. In this case, referring again to fig. 2, a first prediction result may be obtained by first decoding with a first decoding layer (denoted as "decoding layer 1" in fig. 2) corresponding to the first coding layer based on the semantic features of the first scale extracted by the first coding layer; secondly, a second prediction result may be obtained by decoding with a second decoding layer (denoted as "decoding layer 2" in fig. 2) corresponding to the second coding layer based on the semantic features of the second scale extracted by the second coding layer; next, a third prediction result may be obtained by decoding with a third decoding layer (denoted as "decoding layer 3" in fig. 2) corresponding to the third encoding layer based on the semantic features of the third scale extracted by the third encoding layer; finally, a fourth prediction result may be obtained by decoding with a fourth decoding layer (denoted as "decoding layer 4" in fig. 2) corresponding to the fourth encoding layer based on the semantic features of the fourth scale extracted by the fourth encoding layer.

Specifically, the prediction result corresponding to the semantic features of each scale may be obtained by decoding the semantic features of the plurality of scales in the following manner: and fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale, and decoding by using an (n-1) th decoding layer based on the fused semantic features to obtain an (n-1) th prediction result. For example, the semantic features of the nth scale may be first up-sampled, then the up-sampled result may be concatenated with the semantic features of the nth-1 scale, then the concatenated result may be up-sampled, and the n-1 prediction result may be obtained by decoding the n-1 decoding layer based on the up-sampled semantic features.

As an example, the splicing may be along a channel.

As an example, a low-resolution semantic feature (i.e., a small-scale semantic feature) may be upsampled to a high-resolution semantic feature (i.e., a large-scale semantic feature) and stitched with the same-scale semantic feature by means of linear interpolation (e.g., bilinear interpolation). The above fusion process can be represented by the following formula:

X＝f(X₁)+Bilinear(f(X₂))

where f (-) is the convolution operation performed by the coding layer, in this context, the convolution kernel can be, for example, 256, X₁,X₂For the coding layer, Bilinear (-) is a Bilinear interpolation method, where f (X)₁) For coding layer X₁Semantic features extracted by performing convolution operations, f (X)₂) For coding layer X₂And performing convolution operation to extract semantic features.

In the fusion mode, the small-scale semantic features and the large-scale semantic features are sequentially fused according to the size of the semantic feature scale, and the fused semantic features are used for obtaining the corresponding prediction result. For example, N may be 4, in this case, referring to fig. 2, for example, first, semantic features of a fourth scale may be upsampled, and decoded by the fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result (denoted as "512 by 2" in fig. 2);

secondly, the semantic features of the fourth scale may be upsampled, the upsampled result may be concatenated with the semantic features of the third scale to obtain a first concatenated result (denoted as "64 × 768" in fig. 2), the first concatenated result may be upsampled, and the third prediction result (denoted as "512 × 512" 2 "in fig. 2) may be obtained by decoding using the third decoding layer based on the upsampled semantic features;

then, the first concatenation result may be upsampled, the upsampled result may be concatenated with the semantic features of the second scale to obtain a second concatenation result (denoted as "128 × 896" in fig. 2), the second concatenation result may be upsampled, and the decoding may be performed using the second decoding layer based on the upsampled semantic features to obtain a second prediction result (denoted as "512 × 2" in fig. 2);

finally, the second concatenation result may be upsampled, the upsampled result may be concatenated with the semantic features of the first scale to obtain a third concatenation result (denoted as "256 × 960" in fig. 2), the third concatenation result may be upsampled, and the first prediction result (denoted as "512 × 512" in fig. 2) may be decoded using the first decoding layer based on the upsampled semantic features.

Optionally, according to another embodiment of the present disclosure, the prediction result corresponding to the semantic features of each scale may also be obtained by decoding the semantic features of the plurality of scales in the following manner: and fusing the semantic features of the m scale and the semantic features of one scale from 1 st to m-2 nd scales, and decoding by using a decoding layer corresponding to the one scale based on the fused semantic features to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

For example, the semantic features of the m-th scale may be up-sampled first, then the up-sampled result may be concatenated with the semantic features of the one scale, then the concatenated result may be up-sampled, and the decoding layer corresponding to the one scale may be used to decode based on the up-sampled semantic features to obtain the prediction result corresponding to the one scale.

In the above alternative, the small-scale semantic features and the large-scale semantic features may not be fused in sequence according to the size of the semantic feature scale, but a jump connection may exist, so that the time complexity is reduced, the calculation resources are saved, and the prediction result is favorably and quickly obtained.

Specifically, for example, in the case where N may be 4, referring to fig. 3, for example, in the case where the input image is "512 × 1" (where 512 × 512 is the resolution (i.e., size) of the semantic features of the input image, and 1 is the number of semantic feature channels of the input image), first, the semantic features of a fourth scale may be upsampled, and decoding may be performed using a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result (denoted as "512 × 512 ″' 2" in fig. 3);

secondly, the semantic features of the fourth scale may be upsampled, the upsampled result may be concatenated with the semantic features of the second scale to obtain a first concatenated result (denoted as "128 × 640" in fig. 3), the first concatenated result may be upsampled, and the second decoding layer may be decoded based on the upsampled semantic features to obtain a second prediction result (denoted as "512 × 512" 2 in fig. 3);

then, the semantic features of the third scale are upsampled, and decoding is performed by using a third decoding layer based on the upsampled semantic features to obtain a third prediction result (denoted as "512 by 2" in fig. 3);

finally, the first concatenation result is upsampled, the upsampled result is concatenated with the semantic features of the first scale to obtain a second concatenation result (denoted as "256 × 704" in fig. 3), the second concatenation result is upsampled, and decoding is performed using the first decoding layer based on the upsampled semantic features to obtain a first prediction result (denoted as "512 × 512" 2 "in fig. 3).

In both of the above two ways of obtaining the prediction result, the nth scale semantic feature (i.e., the smallest scale semantic feature) is directly upsampled, and the nth decoding layer is used to decode to obtain the nth prediction result (i.e., the prediction result of the highest layer) based on the upsampled semantic feature, however, the disclosure is not limited thereto.

Optionally, as described above, the semantic features of the nth scale and the input image input space pyramid structure may be further extracted first, then the further extracted semantic features are spliced and then upsampled, and an nth prediction result corresponding to the semantic features of the nth scale is obtained by using an nth decoding layer based on the upsampled semantic features.

For example, as shown in fig. 4, for example, in the case where the input image is "512 × 1" (where 512 × 512 is the resolution (i.e., size) of the semantic features of the input image, and 1 is the number of semantic feature channels of the input image), the semantic features at the fourth scale (denoted as "32 × 512" in fig. 4) extracted by the fourth encoding layer (denoted as "encoding layer 4" in fig. 4) may be input into the spatial pyramid structure together with the input image. In the spatial pyramid structure, the input image may be pooled using a pooling layer, e.g., average pooling, weighted average pooling, etc., resulting in semantic features at the image level (denoted as "32 x 1" in fig. 4). In addition, semantic features at a fourth scale may be convolved with a second convolution layer (e.g., a 1X1 convolution) to enhance the coupling of the semantic feature channels, the resulting image characteristic is denoted "32X 512" in fig. 4, and the semantic features at the fourth scale may be each hole convolved with three first convolution layers (e.g., a 3X3 convolution) at different expansion rates to further capture multi-scale information. Here, as an example, the expansion rates may be 6, 12, and 18, respectively. In fig. 4, semantic features of the image obtained after the first convolution layer processing are all "32 × 512", and then the semantic features of the image obtained after the three first convolution layers, the second convolution layer and the pooling layer processing are spliced and dimensionality reduced to obtain the semantic feature "32 × 512" shown in fig. 4.

Referring back to fig. 1, after obtaining the prediction results corresponding to the semantic features of each scale, the prediction results corresponding to the semantic features of each scale may be integrated to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest at step S140.

Specifically, according to an exemplary embodiment, in step S140, the prediction results of the multiple decoding layers may be integrated from bottom to top according to the scale of the semantic features corresponding to the prediction results, and the prediction result of the lower layer and the prediction result of the adjacent upper layer are sequentially integrated to obtain a final prediction result.

For example, in the case that n is greater than 2, the nth prediction result and the (n-1) th prediction result may be integrated, and the integrated result and the (n-2) th prediction result may be integrated; in the case where n is equal to 2, the nth predictor may be integrated with the n-1 th predictor.

As shown in fig. 2 or 3 or 4, in case N may be 4, the above integration process may include: and integrating the fourth prediction result with the third prediction result, further integrating the integrated result with the second prediction result, integrating the further integrated result with the first prediction result, and taking the integrated result as a final prediction result.

As described above, according to the top-down path, the semantic features of various scales extracted through each coding layer have strong semantic features, so that the extracted semantic features can be independently used for prediction, and the prediction functions of the semantic features of each scale are different: the large-scale semantic features cover the whole original image due to the large receptive field, so that the large-scale semantic features can provide a large-range segmentation result, but the result is rough, because the resolution is small, more space semantic features are lost, and the edge segmentation result is not ideal; on the contrary, the small-scale semantic feature has smaller receptive field and can obtain better segmentation result, but the continuity of the segmentation result is not as strong as that of the high-level result. Therefore, independent prediction results based on semantic features of various scales can be complemented, and a plurality of prediction results corresponding to the semantic features of various scales are comprehensively utilized by using semantic feature integration operation, so that a more refined segmentation result can be obtained.

The core idea of the semantic feature integration operation is as follows: and from the prediction result corresponding to the semantic feature with the minimum scale, sequentially integrating the prediction results corresponding to the semantic features with larger scales, and continuously integrating the prediction results corresponding to the semantic features with larger scales to correct the prediction results, wherein the prediction result corresponding to the semantic feature with the minimum scale is a coarse-granularity prediction result, and the prediction result corresponding to the semantic feature with larger scales is a fine-granularity prediction result finer than the coarse-granularity prediction result.

According to an exemplary embodiment, the integration process described above may be: and comparing the two objects to be integrated at the pixel level, and searching the maximum predicted value corresponding to each pixel as a final predicted result. For example, the two objects to be integrated may be compared at the pixel level using a non-maximum suppression method to find the maximum predicted value corresponding to each pixel. Non-maxima suppression, i.e. suppression of elements that are not maxima, can be understood as a local maximum search.

For example, in the example of fig. 2, 3 or 4, four independent predictions of different scales may be simultaneously fed into a layer of the decoder that performs the integration operation (e.g., a Look insert operation layer), and then in this layer, the obtained nth prediction is compared with the nth-1 prediction at a pixel level using non-maximum suppression to obtain a local optimum, and the above steps are repeated until the first prediction is obtained, which is the final prediction.

Referring back to fig. 1, after the final prediction result is obtained, an output image in which the region of interest is segmented from the input image according to the final prediction result may be output in step S150.

As an example, the decoder may further include an output layer, and a final output image is obtained after the final prediction result passes through the output layer. As shown in fig. 2, fig. 3 and fig. 4, the region of interest in the input image can be effectively segmented by using the image segmentation method, wherein the dark region in the output image is the region of interest.

The image segmentation method according to the exemplary embodiments of the present disclosure may be applied to various fields related to image processing. For example, in the field of medical image processing, the above-described image segmentation methods of the present disclosure may be utilized to segment tumor regions in immunohistochemical membrane stained section images. Fig. 5 is a diagram illustrating an example of a segmented image obtained by the image segmentation method according to the embodiment of the present disclosure. As shown in fig. 5, tumor regions in immunohistochemically stained sections can be effectively segmented, wherein dark regions represent tumor regions and light regions represent non-tumor regions.

As described above, the image segmentation method according to various exemplary embodiments of the present disclosure performs image segmentation using a deep neural network model, which requires training to be performed in advance before being used for image segmentation.

As an example, the deep neural network model may be trained by using a deep supervised manner, and in the deep neural network model training process, a back-propagated gradient may be obtained by calculating a prediction loss of each of the plurality of decoding layers of the decoder to update a parameter of each encoding layer in the encoder.

Deep supervision refers to a method for supervising a backbone network by adding an auxiliary classifier to some middle layers of a deep neural network for independent prediction. The deep neural network is easy to have the phenomena of gradient disappearance and too low convergence speed along with the deepening of the network in the training process, and the deep supervision method effectively accelerates the updating of parameters of each coding layer in the coder in the backward propagation process of the network, so that more useful semantic and spatial characteristics can be extracted. Independent prediction results are obtained by independently predicting the extracted semantic features of multiple scales, and a data tool of deep supervision is used for performing loss calculation on each prediction result and obtaining a counter-propagating gradient, so that the parameters of each coding layer can be updated rapidly.

As an example, when the number of the coding layers and the number of the decoding layers (which may also be referred to as prediction layers) are four, the deep supervision process may be that four semantic features with different scales obtained after the spatial pyramid structure coding after the coding layer 1, the coding layer 2, the coding layer 3, and the coding layer 4 are subjected to convolution and upsampling, and softmax activation is combined to obtain four prediction results with the size of 512 × 2, then, cross entropy loss is used to perform loss calculation on the four prediction results obtained through the four decoding layers, loss of the four decoding layers is added to loss of a final output layer to obtain a final total loss value L, and then gradient calculation and back propagation are performed. The formula is as follows:

in the above formula, B represents the number of batch images, B represents the B-th image, N represents the number of deep supervised prediction layers, i represents the i-th prediction layer,

representing the ith prediction probability map obtained for the b-th image,

a prediction probability map, Y, representing the final output layer output_bShowing the manual annotation of the b-th image.

In the training process, the model updates parameters through continuous iteration. And performing performance verification on the model and the parameters of each generation through the divided verification sets, and selecting proper evaluation indexes to evaluate the performance of the model. For example, when the region of interest is a tumor region, the evaluation index may include a tumor region segmentation intersection ratio and an average pixel classification accuracy, where a closer to 1 the tumor region segmentation intersection ratio and the average pixel classification accuracy are, the better the current model and the parameter performance are, and finally, the model with the highest tumor region segmentation intersection ratio and the highest average pixel classification accuracy is used as the optimal model.

The image segmentation method according to the embodiment of the present disclosure has been described above with reference to fig. 1 to 5. According to the image segmentation method disclosed by the embodiment of the disclosure, the overall continuity of the region of interest segmented in the image can be improved by extracting the multi-scale semantic features of the image and integrating the prediction results corresponding to the semantic features of each scale, so that the accuracy of the image segmentation result is improved. In addition, according to the embodiment of the disclosure, by combining the human visual cognition process with the deep learning technology, not only the accurate segmentation of the region of interest is effectively solved, but also the interpretability of the image segmentation method based on the deep learning is further improved.

The present disclosure provides, in addition to the above-mentioned image segmentation method, a corresponding image segmentation apparatus and device, which will be described with reference to fig. 6 and 7.

Fig. 6 is a block diagram illustrating an image segmentation apparatus 600 according to an embodiment of the present disclosure.

Referring to fig. 6, the image segmentation apparatus may include: an acquisition module 610, an extraction module 620, a decoding module 630, an integration module 640, and an output module 650.

The acquisition module 610 may be configured to acquire an input image containing a region of interest.

Here, the region of interest may be any target region of interest to the user, for example, the region of interest may be a tumor region, but is not limited thereto. It should be noted that, according to the different application fields of the image segmentation method of the present disclosure, the input image and the region of interest may be changed accordingly. In addition, the present disclosure is also not limited to the form of the input image, and for example, the input image may be a grayscale image or a color image. In addition, the present disclosure also has no limitation on the manner of acquiring the input image, and for example, the input image including the region of interest may be captured in response to a user request, a previously captured input image may be directly acquired from an external device, or the like. In addition, the input image may be an image of various resolutions.

The extraction module 620 may be configured to extract semantic features of the input image at different scales using a plurality of encoding layers of an encoder, respectively.

According to the embodiment of the disclosure, the plurality of coding layers may include N coding layers, wherein the size of the semantic feature of the nth scale extracted by the nth coding layer is smaller than the size of the semantic feature of the nth-1 st scale extracted by the nth coding layer, the nth scale is the smallest scale in the plurality of scales, N is a positive integer greater than or equal to 2, and N is less than or equal to N and greater than or equal to 2.

As an example, N may be 4. In this case, as shown in fig. 2 above, a semantic feature of the input image at a first scale may be first extracted by using a first encoding layer (denoted as "encoding layer 1" in fig. 2) of the plurality of encoding layers; secondly, extracting semantic features of a second scale of the input image by using a second coding layer (represented as 'coding layer 2' in fig. 2) in the plurality of coding layers based on the semantic features of the first scale; next, extracting semantic features of a third scale of the input image by using a third coding layer (denoted as "coding layer 3" in fig. 2) of the plurality of coding layers based on the semantic features of the second scale; finally, based on the semantic features of the third scale, the semantic features of the fourth scale of the input image are extracted by using a fourth coding layer (represented as "coding layer 4" in fig. 2) of the plurality of coding layers.

According to the embodiment of the present disclosure, after the semantic features of the nth scale extracted by the nth coding layer (i.e., the coding layer 4), the method may further include: and performing further semantic feature extraction on the semantic features of the Nth scale and the input image input space pyramid structure.

According to an embodiment of the present disclosure, the spatial pyramid structure may include a plurality of first convolution layers, second convolution layers, and pooling layers having different expansion rates. Specifically, a plurality of first convolution layers with different expansion rates may be used to perform hole convolution on the nth scale semantic features respectively to further capture multi-scale information; the second convolution layer can be used for further convolution of the semantic features of the Nth scale so as to enhance the coupling of semantic feature channels; the pooling layer may be configured to pool the input images to obtain image-level semantic features.

The decoding module 630 may be configured to decode the semantic features of the plurality of scales by using a plurality of decoding layers of the decoder to obtain a prediction result corresponding to the semantic features of each scale.

According to an embodiment of the present disclosure, the plurality of decoding layers may include N decoding layers. Specifically, in the decoding module 630, based on the semantic features of the nth scale extracted by the nth coding layer, the nth decoding layer corresponding to the nth coding layer may be used for decoding to obtain the nth prediction result corresponding to the semantic features of the nth scale.

Specifically, the prediction result corresponding to the semantic features of each scale may be obtained by decoding the semantic features of the plurality of scales in the following manner: and fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale, and decoding by using an (n-1) th decoding layer based on the fused semantic features to obtain an (n-1) th prediction result. For example, the semantic features of the nth scale may be first up-sampled, then the up-sampled result may be concatenated with the semantic features of the nth-1 scale, then the concatenated result may be up-sampled, and the n-1 th prediction result may be obtained by decoding with the n-1 th decoding layer based on the up-sampled semantic features, where the concatenation may be performed along a channel.

The integration module 640 may be configured to integrate the prediction results corresponding to the semantic features of each scale to obtain a final prediction result as to whether the respective pixels in the input image belong to the region of interest.

According to an exemplary embodiment, in the integration module 640, the prediction results of multiple decoding layers may be integrated from bottom to top according to the scale of the semantic features corresponding to the prediction results, and the prediction result of a lower layer and the prediction result of an adjacent upper layer are sequentially integrated to obtain a final prediction result.

As shown in fig. 2 or 3 or 4 above, in the case where N may be 4, the integration process may include: and integrating the fourth prediction result with the third prediction result, further integrating the integrated result with the second prediction result, integrating the further integrated result with the first prediction result, and taking the integrated result as a final prediction result.

The output module 650 may be configured to output an output image in which the region of interest is segmented from the input image according to the final prediction result.

Since details of the above operations have been introduced in the process of describing the image segmentation method according to the present disclosure, the details are not repeated here for brevity, and the related details can refer to the above description about fig. 1 to 5.

Fig. 7 is a block diagram illustrating an image segmentation apparatus 700 according to an embodiment of the present disclosure.

Referring to fig. 7, an image segmentation apparatus 700 may include a processor 701 and a memory 702. The processor 701 and the memory 702 may both be connected by a bus 703.

The processor 701 may perform various actions and processes according to programs stored in the memory 702. In particular, the processor 701 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X86 or ARM architecture.

The memory 702 stores computer instructions that, when executed by the processor 701, implement the method of image segmentation described above. The memory 702 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

The present disclosure also provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, may implement the method described above. Similarly, computer-readable storage media in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the computer-readable storage media described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The exemplary embodiments of the invention, as set forth in detail above, are intended to be illustrative, not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof may be made without departing from the principles and spirit of the invention, and that such modifications are intended to be within the scope of the invention.

Claims

1. A method of performing image segmentation using a deep neural network model, wherein the deep neural network model comprises an encoder and a decoder, the method comprising:

acquiring an input image containing a region of interest;

utilizing a plurality of coding layers of a coder to respectively extract semantic features of different scales of the input image;

respectively decoding the semantic features of the multiple scales by utilizing a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic feature of each scale;

integrating the prediction results corresponding to the semantic features of each scale to obtain a final prediction result about whether each pixel in the input image belongs to the region of interest; and

and outputting an output image of the region of interest segmented from the input image according to the final prediction result.

2. The method of claim 1, wherein,

the plurality of coding layers comprise N coding layers, wherein the size of the semantic features of the nth scale extracted by the nth coding layer is smaller than the size of the semantic features of the nth-1 scale extracted by the nth coding layer, the nth scale is the smallest scale in the plurality of scales, N is a positive integer greater than or equal to 2, and N is less than or equal to N and greater than or equal to 2.

3. The method of claim 2, wherein the encoding layers include a convolutional layer, a pooling layer, a batch normalization layer, and an activation layer.

4. The method of claim 2, wherein N is 4, the extracting semantic features of the input image at different scales using a plurality of encoding layers of an encoder, respectively, comprising:

extracting semantic features of the input image at a first scale by using a first coding layer of the plurality of coding layers;

extracting semantic features of the input image at a second scale by using a second coding layer in the plurality of coding layers based on the semantic features at the first scale;

extracting semantic features of the input image at a third scale by using a third coding layer in the plurality of coding layers based on the semantic features at the second scale;

and extracting the semantic features of the input image at a fourth scale by utilizing a fourth coding layer in the plurality of coding layers based on the semantic features at the third scale.

5. The method as claimed in claim 4, wherein after the Nth scale semantic features extracted by the Nth coding layer, the method further comprises:

and performing further semantic feature extraction on the semantic features of the Nth scale and the input image input space pyramid structure.

6. The method of claim 5, wherein the spatial pyramid structure comprises:

the first convolution layers with different expansion rates are used for respectively carrying out cavity convolution on the semantic features of the Nth scale so as to further capture multi-scale information;

the second convolution layer is used for further convolving the semantic features of the Nth scale so as to enhance the coupling of the semantic feature channel;

and the pooling layer is used for pooling the input images to obtain semantic features of image levels.

7. The method of claim 2, wherein the plurality of decoding layers comprises N decoding layers, wherein the decoding the semantic features of the plurality of scales with the plurality of decoding layers of the decoder, respectively, comprises:

and decoding by using an nth decoding layer corresponding to the nth coding layer to obtain an nth prediction result corresponding to the semantic features of the nth scale based on the semantic features of the nth scale extracted by the nth coding layer.

8. The method as claimed in claim 7, wherein N is 4, and the decoding with the nth decoding layer corresponding to the nth coding layer based on the semantic features of the nth scale extracted by the nth coding layer to obtain the nth prediction result comprises:

decoding by using a first decoding layer corresponding to the first coding layer to obtain a first prediction result based on semantic features of a first scale extracted by the first coding layer;

decoding by using a second decoding layer corresponding to the second coding layer to obtain a second prediction result based on the semantic features of the second scale extracted by the second coding layer;

decoding with a third decoding layer corresponding to the third coding layer to obtain a third prediction result based on the semantic features of the third scale extracted by the third coding layer;

and decoding by using a fourth decoding layer corresponding to the fourth coding layer to obtain a fourth prediction result based on the semantic features of the fourth scale extracted by the fourth coding layer.

9. The method of claim 5, wherein the decoding the semantic features of the plurality of scales with a plurality of decoding layers of a decoder to obtain the prediction result corresponding to the semantic features of each scale comprises:

and fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale, and decoding by using an (n-1) th decoding layer based on the fused semantic features to obtain an (n-1) th prediction result.

10. The method of claim 9, wherein the fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale and decoding with the (n-1) th decoding layer based on the fused semantic features to obtain the (n-1) th prediction result comprises:

the semantic features of the nth scale are sampled upwards, the upsampled result is spliced with the semantic features of the nth-1 scale, the spliced result is upsampled, and the n-1 decoding layer is used for decoding based on the upsampled semantic features to obtain an n-1 prediction result.

11. The method of claim 9, wherein N is 4, the decoding the semantic features of the plurality of scales with a plurality of decoding layers of a decoder to obtain the prediction results corresponding to the semantic features of each scale, comprising:

the semantic features of the fourth scale are subjected to up-sampling, and decoding is carried out by utilizing a fourth decoding layer on the basis of the up-sampled semantic features to obtain a fourth prediction result;

the semantic features of the fourth scale are subjected to up-sampling, the up-sampled result and the semantic features of the third scale are spliced to obtain a first splicing result, the first splicing result is subjected to up-sampling, and a third decoding layer is used for decoding based on the up-sampled semantic features to obtain a third prediction result;

the first splicing result is subjected to upsampling, the upsampled result is spliced with the semantic features of the second scale to obtain a second splicing result, the second splicing result is subjected to upsampling, and decoding is carried out by utilizing a second decoding layer based on the upsampled semantic features to obtain a second prediction result;

and upsampling the second splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a third splicing result, upsampling the third splicing result, and decoding by using the first decoding layer based on the upsampled semantic features to obtain a first prediction result.

12. The method of claim 5, wherein the decoding the semantic features of the plurality of scales with a plurality of decoding layers of a decoder to obtain the prediction result corresponding to the semantic features of each scale comprises:

and fusing the semantic features of the m scale and the semantic features of one scale from 1 st to m-2 nd scales, and decoding by using a decoding layer corresponding to the one scale based on the fused semantic features to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

13. The method of claim 12, wherein the fusing the semantic features of the m-th scale with the semantic features of one of the 1 st to m-2 nd scales and decoding with a decoding layer corresponding to the one scale based on the fused semantic features to obtain the prediction result corresponding to the one scale comprises:

the semantic features of the m-th scale are subjected to up-sampling, the up-sampled result is spliced with the semantic features of one scale, the spliced result is subjected to up-sampling, and decoding is performed by utilizing a decoding layer corresponding to the one scale on the basis of the up-sampled semantic features to obtain a prediction result corresponding to the one scale.

14. The method of claim 12, wherein N is 4, the decoding the semantic features of the plurality of scales with a plurality of decoding layers of a decoder to obtain the prediction results corresponding to the semantic features of each scale, comprising:

the semantic features of the fourth scale are subjected to up-sampling, the up-sampled result and the semantic features of the second scale are spliced to obtain a first splicing result, the first splicing result is subjected to up-sampling, and a second decoding layer is used for decoding based on the up-sampled semantic features to obtain a second prediction result;

the semantic features of the third scale are subjected to up-sampling, and decoding is carried out by utilizing a third decoding layer on the basis of the up-sampled semantic features to obtain a third prediction result;

and performing upsampling on the first splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a second splicing result, performing upsampling on the second splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

15. The method of claim 9 or 12, wherein the decoding the semantic features of the plurality of scales with a plurality of decoding layers of a decoder to obtain the prediction result corresponding to the semantic features of each scale, further comprises:

and performing upsampling after splicing the further extracted semantic features, and obtaining an Nth prediction result corresponding to the semantic features of the Nth scale by utilizing an Nth decoding layer based on the upsampled semantic features.

16. The method of claim 7, wherein the integrating the predicted results corresponding to the semantic features of each scale comprises:

and integrating the prediction results of the multiple decoding layers from bottom to top according to the scale of the semantic features corresponding to the prediction results, and sequentially integrating the prediction results of the lower layer and the prediction results of the adjacent upper layer to obtain the final prediction result.

17. The method of claim 16, wherein the integrating the prediction results of the multiple decoding layers from bottom to top according to the size of the semantic features corresponding to the prediction results, and sequentially integrating the prediction result of the lower layer with the prediction result of the adjacent upper layer to obtain the final prediction result comprises:

under the condition that n is larger than 2, integrating the nth prediction result and the n-1 th prediction result, and integrating the integrated result and the n-2 th prediction result;

in the case where n is equal to 2, the nth predictor is integrated with the n-1 th predictor.

18. The method of claim 17, wherein N is 4, and the integrating the predicted results corresponding to the semantic features of each scale comprises:

and integrating the fourth prediction result with the third prediction result, further integrating the integrated result with the second prediction result, integrating the further integrated result with the first prediction result, and taking the integrated result as a final prediction result.

19. The method of claim 17, wherein the integrating comprises comparing the two objects to be integrated at a pixel level, finding a maximum predicted value corresponding to each pixel as a final predicted result.

20. The method of claim 1, wherein the deep neural network model is trained using deep supervised mode,

wherein, in the deep neural network model training process, a backward-propagated gradient is obtained by calculating a prediction loss of each of the plurality of decoding layers of the decoder to update parameters of each encoding layer in the encoder.

21. An apparatus for performing image segmentation using a deep neural network model, wherein the deep neural network model includes an encoder and a decoder, the apparatus comprising:

an acquisition module configured to acquire an input image containing a region of interest;

an extraction module configured to extract semantic features of the input image at different scales respectively by using a plurality of encoding layers of an encoder;

a decoding module configured to decode the semantic features of the plurality of scales respectively by using a plurality of decoding layers of a decoder to obtain a prediction result corresponding to the semantic features of each scale;

an integration module configured to integrate the prediction results corresponding to the semantic features of each scale to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest; and

an output module configured to output an output image in which the region of interest is segmented from an input image according to the final prediction result.

22. The apparatus of claim 21, wherein,

23. The apparatus of claim 22, wherein the encoding layer comprises a convolutional layer, a pooling layer, a batch normalization layer, and an activation layer.

24. The apparatus of claim 22, wherein N is 4, the extraction module comprising:

25. The apparatus of claim 24, wherein the nth scale semantic features extracted by the nth coding layer are followed by:

26. The apparatus of claim 25, wherein the spatial pyramid structure comprises:

27. The apparatus of claim 22, wherein the plurality of decoding layers comprises N decoding layers, wherein the decoding the semantic features of the plurality of scales with the plurality of decoding layers of the decoder, respectively, comprises:

28. The apparatus of claim 27, wherein N is 4, and the decoding with the nth decoding layer corresponding to the nth coding layer based on the semantic features of the nth scale extracted by the nth coding layer to obtain the nth prediction result comprises:

29. The apparatus of claim 22, wherein the decoding module comprises:

30. The method of claim 29, wherein the fusing the semantic features of the nth scale and the semantic features of the (n-1) th scale and decoding with the (n-1) th decoding layer based on the fused semantic features to obtain the (n-1) th prediction result comprises:

31. The apparatus of claim 29, wherein N is 4, the decoding module comprising:

32. The apparatus of claim 22, wherein the decoding module comprises:

33. The apparatus of claim 32, wherein the fusing the semantic features of the m-th scale with the semantic features of one of the 1 st to m-2 nd scales and decoding with a decoding layer corresponding to the one scale based on the fused semantic features to obtain the prediction result corresponding to the one scale comprises:

34. The apparatus of claim 32, wherein N is 4, the decode module comprising:

35. The apparatus of claim 29 or 32, wherein the decoding module further comprises:

36. The apparatus of claim 27, wherein the integrating the predicted results corresponding to the semantic features of each scale comprises:

37. The apparatus of claim 36, wherein the integrating the prediction results of the multiple decoding layers from bottom to top according to the size of the semantic features corresponding to the prediction results, and sequentially integrating the prediction result of the lower layer with the prediction result of the adjacent upper layer to obtain the final prediction result comprises:

38. The apparatus of claim 37, wherein N is 4, and the integrating the predicted results corresponding to the semantic features of each scale comprises:

39. The apparatus of claim 37, wherein the integrating comprises comparing two objects to be integrated at a pixel level, finding a maximum prediction value corresponding to each pixel as a final prediction result.

40. The apparatus of claim 21, wherein the deep neural network model is trained using deep supervision,

41. An apparatus for performing image segmentation using a deep neural network model, comprising:

a processor, and

a memory storing computer-executable instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 1-20.

42. A computer-readable recording medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, cause the processor to perform the method of any one of claims 1-20.